VideoNeVA adds support for video modality in NeVa by representing video as multiple image frames.
To enable pretraining on video input data, a minor modification has been made to the MegatronNevaModel class within the nemo.collections.multimodal.models.multimodal_llm.neva.neva_model
module.
Representing video input as a series of images is handled by the TarOrFolderVideoLoader class in the nemo.collections.multimodal.data.neva
module. This process utilizes Decord, which offers convenient video slicing methods.
data:
media_type: video
splice_single_frame: null
num_frames: 8
image_token_len: 256
image_folder: null
video_folder: null
media_type
: When set to video, NeVa’s dataloader performs additional preprocessing steps to represent the input video data as a series of image frames.splice_single_frame
: This parameter can be set to either first, middle, or last. It determines which specific frame within the video will be selected.image_token_len
: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.
image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256
num_frames
: This parameter is used to select the number of image frames that will be used to represent the video.video_folder
: This parameter specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.
Feature |
Training |
Inference |
---|---|---|
Data parallelism | Yes | N/A |
Tensor parallelism | Yes | Yes |
Pipeline parallelism | No | No |
Sequence parallelism | No | No |
Activation checkpointing | Yes (Uniform or Block) | No |
FP32/TF32 | Yes | Yes (FP16 enabled by default) |
AMP/FP16 | No | Yes |
AMP/BF16 | Yes | No |
BF16 O2 | Yes | No |
TransformerEngine/FP8 | No | No |
Multi-GPU | Yes | Yes |
Multi-Node | Yes | Yes |
Inference deployment | N/A | NVIDIA Triton supported |
SW stack support | Slurm DeepOps/Base Command Manager/Base Command Platform | Slurm DeepOps/Base Command Manager/Base Command Platform |
NVfuser | No | N/A |
Distributed Optimizer | No | N/A |
TorchInductor | No | N/A |
Flash Attention | Yes | N/A |