Video NeVA

User Guide (Latest Version)

Model Introduction

Video NeVa adds support for video modality in NeVa by representing video as multiple image frames.

There is only a minor change done to MegatronNevaModel class in order to support pretraining on video input data.

Representing video input as a series of images is done in TarOrFolderVideoLoader class, using Decord which provides convenient video slicing methods.

Video Neva Configuration

Copy
Copied!
            

data: media_type: video splice_single_frame: null num_frames: 8 image_token_len: 256 image_folder: null video_folder: null

  • media_type: If set to video, NeVa’s dataloader goes through the additional preprocessing steps to represent the input video data as a series of image frames.

  • splice_single_frame: Can either be set as first, middle or last. This will result in only a single frame in that specific location of the video being selected.

  • image_token_len: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.

Copy
Copied!
            

image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256

  • num_frames: This is used to select the number of image frames that will be used to represent the video.

  • video_folder: This specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.

We can run neva_evaluation.py located in NeMo/examples/multimodal/multimodal_llm/neva to generate inference results from the Video NeVA model. Currently, video NeVA supports both image and video inference by changing the config attribute inference.media_type in NeMo/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml to either image or video, and adding the corresponding media path inference.media_base_path.

Inference with Pretrained Projectors with Base LM Model

An example of an inference script execution:

For running video inference:

Copy
Copied!
            

CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \ --config-path=/path/to/conf/ \ --config-name=neva_inference.yaml \ tensor_model_parallel_size=4 \ pipeline_model_parallel_size=1 \ neva_model_file=/path/to/projector/checkpoint \ base_model_file=/path/to/base/lm/checkpoint \ trainer.devices=4 \ trainer.precision=bf16 \ prompt_file=/path/to/prompt/file \ inference.media_base_path=/path/to/videos \ inference.media_type=video \ output_file=/path/for/output/file/ \ inference.temperature=0.2 \ inference.top_k=0 \ inference.top_p=0.9 \ inference.greedy=False \ inference.add_BOS=False \ inference.all_probs=False \ inference.repetition_penalty=1.2 \ inference.insert_media_token=right \ inference.tokens_to_generate=256 \ quantization.algorithm=awq \ quantization.enable=False

Example format of .jsonl prompt_file:

Copy
Copied!
            

{"video": "video_test.mp4", "text": "Can you describe the scene?", "category": "conv", "question_id": 0}

input video file: video_test.mp4

Output:

Copy
Copied!
            

<extra_id_0>System A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. <extra_id_1>User Can you describe the scene?<video> <extra_id_1>Assistant <extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4 CLEAN RESPONSE: Hand with a robot arm

Inference with Finetuned Video NeVA Model (No Need to Specify Base LM)

An example of an inference script execution:

For running video inference:

Copy
Copied!
            

CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \ --config-path=/path/to/conf/ \ --config-name=neva_inference.yaml \ tensor_model_parallel_size=4 \ pipeline_model_parallel_size=1 \ neva_model_file=/path/to/video/neva/model \ trainer.devices=4 \ trainer.precision=bf16 \ prompt_file=/path/to/prompt/file \ inference.media_base_path=/path/to/videos \ inference.media_type=video \ output_file=/path/for/output/file/ \ inference.temperature=0.2 \ inference.top_k=0 \ inference.top_p=0.9 \ inference.greedy=False \ inference.add_BOS=False \ inference.all_probs=False \ inference.repetition_penalty=1.2 \ inference.insert_media_token=right \ inference.tokens_to_generate=256 \ quantization.algorithm=awq \ quantization.enable=False

We can run mixtral_eval.py localted in NeMo/examples/multimodal/multimodal_llm/neva to call mixtral api to give scores for the generated responses of two models. Here we use llava-bench-in-the-wild as an example.

Set up

Before running the script, we need to set up NGC API KEY for calling the foundation models on NVIDIA NGC. Once you set up your account on NGC, you can login in and go to here: and click Get API Key. Save the key.

Download dataset

We first download llava-bench-in-the-wild dataset:

Copy
Copied!
            

git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild

And download the rule.json.

Notice the answer file in llava-bench-in-the-wild is consisted of rows of json string:

Copy
Copied!
            

{"question_id": 0, "prompt": "What is the name of this famous sight in the photo?", "answer_id": "TeyehNxHw5j8naXfEWaxWd", "model_id": "gpt-4-0314", "metadata": {}, "text": "The famous sight in the photo is Diamond Head."}

You may also have your own response file as:

Copy
Copied!
            

{"response_id": 0, "response": "The famous sight in the photo is Diamond Head."}

Both formats are ok.

Evaluation

Install package:

Copy
Copied!
            

pip install shortuuid

Now you can run the script simply by:

Copy
Copied!
            

API_TOKEN=nvapi-<the api you just saved> python3 NeMo/examples/multimodal/multimodal_llm/neva/eval/mixtral_eval.py --model-name-list gpt bard --media-type image \ --question-file llava-bench-in-the-wild/questions.jsonl \ # the question file --responses-list llava-bench-in-the-wild/answers_gpt4.jsonl llava-bench-in-the-wild/bard_0718.jsonl \ # two answer files / response files --answers-dir ./ \ # to save the answers --context-file llava-bench-in-the-wild/context.jsonl \ # context file --output ./output.json # the generated mixtral reviews for the two models

You’ll see the result like:

Copy
Copied!
            

all 84.8 72.4 llava_bench_complex 77.0 69.0 llava_bench_conv 91.8 77.1 llava_bench_detail 91.3 73.2

Notice when you start a new comparison, you should remove the output.json file

References

Previous NeVA
Next Sequence Packing for NeVA
© | | | | | | |. Last updated on Jun 19, 2024.