VideoNeVA

VideoNeVA adds support for video modality in NeVa by representing video as multiple image frames.

To enable pretraining on video input data, a minor modification has been made to the MegatronNevaModel class within the nemo.collections.multimodal.models.multimodal_llm.neva.neva_model module.

Representing video input as a series of images is handled by the TarOrFolderVideoLoader class in the nemo.collections.multimodal.data.neva module. This process utilizes Decord, which offers convenient video slicing methods.

Configure VideoNeVA

data:
  media_type: video
  splice_single_frame: null
  num_frames: 8
  image_token_len: 256
  image_folder: null
  video_folder: null
  • media_type: When set to video, NeVa’s dataloader performs additional preprocessing steps to represent the input video data as a series of image frames.

  • splice_single_frame: This parameter can be set to either first, middle, or last. It determines which specific frame within the video will be selected.

  • image_token_len: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.

image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256
  • num_frames: This parameter is used to select the number of image frames that will be used to represent the video.

  • video_folder: This parameter specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.

Feature

Training

Inference

Data parallelism

Yes

N/A

Tensor parallelism

Yes

Yes

Pipeline parallelism

No

No

Sequence parallelism

No

No

Activation checkpointing

Yes (Uniform or Block)

No

FP32/TF32

Yes

Yes (FP16 enabled by default)

AMP/FP16

No

Yes

AMP/BF16

Yes

No

BF16 O2

Yes

No

TransformerEngine/FP8

No

No

Multi-GPU

Yes

Yes

Multi-Node

Yes

Yes

Inference deployment

N/A

NVIDIA Triton supported

SW stack support

Slurm DeepOps/Base Command Manager/Base Command Platform

Slurm DeepOps/Base Command Manager/Base Command Platform

NVfuser

No

N/A

Distributed Optimizer

No

N/A

TorchInductor

No

N/A

Flash Attention

Yes

N/A