Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

VideoNeVA#

VideoNeVA adds support for video modality in NeVa by representing video as multiple image frames.

To enable pretraining on video input data, a minor modification has been made to the MegatronNevaModel class within the nemo.collections.multimodal.models.multimodal_llm.neva.neva_model module.

Representing video input as a series of images is handled by the TarOrFolderVideoLoader class in the nemo.collections.multimodal.data.neva module. This process utilizes Decord, which offers convenient video slicing methods.

Configure VideoNeVA#

data:
  media_type: video
  splice_single_frame: null
  num_frames: 8
  image_token_len: 256
  image_folder: null
  video_folder: null

media_type: When set to video, NeVa’s dataloader performs additional preprocessing steps to represent the input video data as a series of image frames.
splice_single_frame: This parameter can be set to either first, middle, or last. It determines which specific frame within the video will be selected.
image_token_len: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.

image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256

num_frames: This parameter is used to select the number of image frames that will be used to represent the video.
video_folder: This parameter specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	Yes	Yes
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	Yes (Uniform or Block)	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	No	Yes
AMP/BF16	Yes	No
BF16 O2	Yes	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton supported
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	No	N/A
TorchInductor	No	N/A
Flash Attention	Yes	N/A