Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Video NeVA
Model Introduction
Video NeVa adds support for video modality in NeVa by representing video as multiple image frames.
There is only a minor change done to MegatronNevaModel
class in order to support pretraining on video input data.
Representing video input as a series of images is done in TarOrFolderVideoLoader
class, using Decord which provides convenient video slicing methods.
Video Neva Configuration
data:
media_type: video
splice_single_frame: null
num_frames: 8
image_token_len: 256
image_folder: null
video_folder: null
media_type
: If set to video, NeVa’s dataloader goes through the additional preprocessing steps to represent the input video data as a series of image frames.splice_single_frame
: Can either be set as first, middle or last. This will result in only a single frame in that specific location of the video being selected.image_token_len
: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.
image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256
num_frames
: This is used to select the number of image frames that will be used to represent the video.video_folder
: This specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.
Inference with Video NeVA
We can run neva_evaluation.py
located in NeMo/examples/multimodal/multimodal_llm/neva
to generate inference results from the Video NeVA model.
Currently, video NeVA supports both image and video inference by changing the config attribute inference.media_type
in NeMo/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml
to either image
or video
, and adding the corresponding media path inference.media_base_path
.
Inference with Pretrained Projectors with Base LM Model
An example of an inference script execution:
For running video inference:
CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \
--config-path=/path/to/conf/ \
--config-name=neva_inference.yaml \
tensor_model_parallel_size=4 \
pipeline_model_parallel_size=1 \
neva_model_file=/path/to/projector/checkpoint \
base_model_file=/path/to/base/lm/checkpoint \
trainer.devices=4 \
trainer.precision=bf16 \
prompt_file=/path/to/prompt/file \
inference.media_base_path=/path/to/videos \
inference.media_type=video \
output_file=/path/for/output/file/ \
inference.temperature=0.2 \
inference.top_k=0 \
inference.top_p=0.9 \
inference.greedy=False \
inference.add_BOS=False \
inference.all_probs=False \
inference.repetition_penalty=1.2 \
inference.insert_media_token=right \
inference.tokens_to_generate=256 \
quantization.algorithm=awq \
quantization.enable=False
Example format of .jsonl
prompt_file:
{"video": "video_test.mp4", "text": "Can you describe the scene?", "category": "conv", "question_id": 0}
input video file: video_test.mp4
Output:
<extra_id_0>System
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<extra_id_1>User
Can you describe the scene?<video>
<extra_id_1>Assistant
<extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4
CLEAN RESPONSE: Hand with a robot arm
Inference with Finetuned Video NeVA Model (No Need to Specify Base LM)
An example of an inference script execution:
For running video inference:
CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \
--config-path=/path/to/conf/ \
--config-name=neva_inference.yaml \
tensor_model_parallel_size=4 \
pipeline_model_parallel_size=1 \
neva_model_file=/path/to/video/neva/model \
trainer.devices=4 \
trainer.precision=bf16 \
prompt_file=/path/to/prompt/file \
inference.media_base_path=/path/to/videos \
inference.media_type=video \
output_file=/path/for/output/file/ \
inference.temperature=0.2 \
inference.top_k=0 \
inference.top_p=0.9 \
inference.greedy=False \
inference.add_BOS=False \
inference.all_probs=False \
inference.repetition_penalty=1.2 \
inference.insert_media_token=right \
inference.tokens_to_generate=256 \
quantization.algorithm=awq \
quantization.enable=False
Evaluation with Mixtral as a judge
We can run mixtral_eval.py
localted in NeMo/examples/multimodal/multimodal_llm/neva
to call mixtral api to give scores for the generated responses of two models.
Here we use llava-bench-in-the-wild
as an example.
Set up
Before running the script, we need to set up NGC API KEY
for calling the foundation models on NVIDIA NGC. Once you set up your account on NGC, you can login in and go to here: and click Get API Key
. Save the key.
Download dataset
We first download llava-bench-in-the-wild
dataset:
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
And download the rule.json.
Notice the answer file in llava-bench-in-the-wild
is consisted of rows of json string:
{"question_id": 0, "prompt": "What is the name of this famous sight in the photo?", "answer_id": "TeyehNxHw5j8naXfEWaxWd", "model_id": "gpt-4-0314", "metadata": {}, "text": "The famous sight in the photo is Diamond Head."}
You may also have your own response file as:
{"response_id": 0, "response": "The famous sight in the photo is Diamond Head."}
Both formats are ok.
Evaluation
Install package:
pip install shortuuid
Now you can run the script simply by:
API_TOKEN=nvapi-<the api you just saved> python3 NeMo/examples/multimodal/multimodal_llm/neva/eval/mixtral_eval.py --model-name-list gpt bard --media-type image \
--question-file llava-bench-in-the-wild/questions.jsonl \ # the question file
--responses-list llava-bench-in-the-wild/answers_gpt4.jsonl llava-bench-in-the-wild/bard_0718.jsonl \ # two answer files / response files
--answers-dir ./ \ # to save the answers
--context-file llava-bench-in-the-wild/context.jsonl \ # context file
--output ./output.json # the generated mixtral reviews for the two models
You’ll see the result like:
all 84.8 72.4
llava_bench_complex 77.0 69.0
llava_bench_conv 91.8 77.1
llava_bench_detail 91.3 73.2
Notice when you start a new comparison, you should remove the output.json
file