VideoNeVA
VideoNeVA adds support for video modality in NeVa by representing video as multiple image frames.
To enable pretraining on video input data, a minor modification has been made to the MegatronNevaModel class within the nemo.collections.multimodal.models.multimodal_llm.neva.neva_model
module.
Representing video input as a series of images is handled by the TarOrFolderVideoLoader class in the nemo.collections.multimodal.data.neva
module. This process utilizes Decord, which offers convenient video slicing methods.
Configure VideoNeVA
data:
media_type: video
splice_single_frame: null
num_frames: 8
image_token_len: 256
image_folder: null
video_folder: null
media_type
: When set to video, NeVa’s dataloader performs additional preprocessing steps to represent the input video data as a series of image frames.splice_single_frame
: This parameter can be set to either first, middle, or last. It determines which specific frame within the video will be selected.image_token_len
: The NeVa dataloader calculates image_token_len based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.
image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256
num_frames
: This parameter is used to select the number of image frames that will be used to represent the video.video_folder
: This parameter specifies the directory where the video files are located. This follows the same format as NeVa’s image_folder.
Feature |
Training |
Inference |
---|---|---|
Data parallelism |
Yes |
N/A |
Tensor parallelism |
Yes |
Yes |
Pipeline parallelism |
No |
No |
Sequence parallelism |
No |
No |
Activation checkpointing |
Yes (Uniform or Block) |
No |
FP32/TF32 |
Yes |
Yes (FP16 enabled by default) |
AMP/FP16 |
No |
Yes |
AMP/BF16 |
Yes |
No |
BF16 O2 |
Yes |
No |
TransformerEngine/FP8 |
No |
No |
Multi-GPU |
Yes |
Yes |
Multi-Node |
Yes |
Yes |
Inference deployment |
N/A |
|
SW stack support |
Slurm DeepOps/Base Command Manager/Base Command Platform |
Slurm DeepOps/Base Command Manager/Base Command Platform |
NVfuser |
No |
N/A |
Distributed Optimizer |
No |
N/A |
TorchInductor |
No |
N/A |
Flash Attention |
Yes |
N/A |