Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

NeVA (LLaVA)

Originating from LLaVA (Large Language and Vision Assistant), NeVA is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models (like NVGPT or Llama2) with a vision encoder, and is trained with machine-generated multimodal language-image instruction-following data. Building on the foundation set by LLaVA, NeVA further enhances training by leveraging features of the NeMo LLM framework such as model parallelism, activation checkpointing, AMP O2, Flash Attention, and more. While traditional language models have been primarily focused on textual processing, NeVA boldly adopts a holistic approach, bridging visual and linguistic comprehension.

New in NeVA

Here are some of the new features introduced in NeVA.

  1. Pipeline Parallelism Support: Pipeline parallelism is added in NeVA. When employing pipeline parallelism, the vision encoder (if loaded from Hugging Face) will duplicate on GPUs where the pipeline parallelism rank equals 0.

    Loading ViT from HF
    DP0
       PP rank 0
          TP rank 0 (if HF, ViT)
          TP rank 1 (if HF, ViT)
       PP rank 1
          TP rank 0
          TP rank 1
    DP1
       PP rank 0
          TP rank 0 (if HF, ViT)
          TP rank 1 (if HF, ViT)
       PP rank 1
          TP rank 0
          TP rank 1
    
    Loading ViT from .nemo
    DP0
       PP rank 0
          TP rank 0 (if NeMo, ViT TP rank 0)
          TP rank 1 (if NeMo, ViT TP rank 1)
       PP rank 1
          TP rank 0
          TP rank 1
    DP1
       PP rank 0
          TP rank 0 (if NeMo, ViT TP rank 0)
          TP rank 1 (if NeMo, ViT TP rank 1)
       PP rank 1
          TP rank 0
          TP rank 1
    
  2. Mixtral Support with Expert Parallelism: NeVA now supports Mixtral as a foundation LLM with Expert Parallelism.

  3. SigLIP Encoder from Hugging Face: NeVA has been updated to use the SigLIP encoder, which can be sourced from Hugging Face. You can now configure this feature by setting model.mm_cfg.vision_encoder.from_pretrained='google/siglip-so400m-patch14-384'. Support for integrating the SigLIP module into the NeMo Multimodal framework is underway and will be available soon.

  4. Distributed Checkpoint Format Support: NeVA now supports a distributed checkpoint format. You no longer need to manually change checkpoint partitions from pretraining to tuning or from training to inference. You can now load checkpoints in any new TP/PP combinations. For example, pretrain with TP=8, PP=1, and then fine-tune with TP=8, PP=2. All newly created checkpoints will automatically be saved in a distributed format. This process requires no changes to the existing configuration.

  5. Sequence Packing: Refer to the documentation at NeVA Sequence Packing.

  6. NeVA now supports a broader range of language models as backend, including:

    • LLama 2 & 3

    • Mistral & Mixtral

    • Nemotron-4

Feature

Training

Inference

Data parallelism

Yes

N/A

Tensor parallelism

Yes

Yes

Pipeline parallelism

Yes

No

Sequence parallelism

Yes

No

Activation checkpointing

Yes

No

FP32/TF32

Yes

Yes (FP16 enabled by default)

AMP/FP16

No

Yes

AMP/BF16

Yes

No

BF16 O2

Yes

No

TransformerEngine/FP8

Yes

No

Multi-GPU

Yes

Yes

Multi-Node

Yes

Yes

Inference deployment

N/A

NVIDIA Triton supported

SW stack support

Slurm DeepOps/Base Command Manager/Base Command Platform

Slurm DeepOps/Base Command Manager/Base Command Platform

NVfuser

No

N/A

Distributed Optimizer

No

N/A

TorchInductor

No

N/A

Flash Attention

Yes

N/A