Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

VideoNeVA

VideoNeVA adds support for video modality in NeVa by representing video as multiple image frames.

To enable pretraining on video input data, a minor modification has been made to the MegatronNevaModel class within the nemo.collections.multimodal.models.multimodal_llm.neva.neva_model module.

Representing video input as a series of images is handled by the TarOrFolderVideoLoader class in the nemo.collections.multimodal.data.neva module. This process utilizes Decord, which offers convenient video slicing methods.

Configure VideoNeVA

data:
  media_type: video
  splice_single_frame: null
  num_frames: 8
  image_token_len: 256
  image_folder: null
  video_folder: null
  • media_type: When set to video, NeVa’s dataloader performs additional preprocessing steps to represent the input video data as a series of image frames.

  • splice_single_frame: This parameter can be set to either first, middle, or last. It determines which specific frame within the video will be selected.

  • image_token_len: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.

image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256
  • num_frames: This parameter is used to select the number of image frames that will be used to represent the video.

  • video_folder: This parameter specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.

Feature

Training

Inference

Data parallelism

Yes

N/A

Tensor parallelism

Yes

Yes

Pipeline parallelism

No

No

Sequence parallelism

No

No

Activation checkpointing

Yes (Uniform or Block)

No

FP32/TF32

Yes

Yes (FP16 enabled by default)

AMP/FP16

No

Yes

AMP/BF16

Yes

No

BF16 O2

Yes

No

TransformerEngine/FP8

No

No

Multi-GPU

Yes

Yes

Multi-Node

Yes

Yes

Inference deployment

N/A

NVIDIA Triton supported

SW stack support

Slurm DeepOps/Base Command Manager/Base Command Platform

Slurm DeepOps/Base Command Manager/Base Command Platform

NVfuser

No

N/A

Distributed Optimizer

No

N/A

TorchInductor

No

N/A

Flash Attention

Yes

N/A