VideoNeVA

VideoNeVA adds support for video modality in NeVa by representing video as multiple image frames.

To enable pretraining on video input data, a minor modification has been made to the MegatronNevaModel class within the nemo.collections.multimodal.models.multimodal_llm.neva.neva_model module.

Representing video input as a series of images is handled by the TarOrFolderVideoLoader class in the nemo.collections.multimodal.data.neva module. This process utilizes Decord, which offers convenient video slicing methods.

Configure VideoNeVA

data:
  media_type: video
  splice_single_frame: null
  num_frames: 8
  image_token_len: 256
  image_folder: null
  video_folder: null

media_type: When set to video, NeVa’s dataloader performs additional preprocessing steps to represent the input video data as a series of image frames.
splice_single_frame: This parameter can be set to either first, middle, or last. It determines which specific frame within the video will be selected.
image_token_len: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.

image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256

num_frames: This parameter is used to select the number of image frames that will be used to represent the video.
video_folder: This parameter specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	Yes	Yes
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	Yes (Uniform or Block)	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	No	Yes
AMP/BF16	Yes	No
BF16 O2	Yes	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton supported
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	No	N/A
TorchInductor	No	N/A
Flash Attention	Yes	N/A