NeVA (LLaVA) - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide NeVA (LLaVA)

Originating from LLaVA (Large Language and Vision Assistant), NeVA is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models (like NVGPT or Llama2) with a vision encoder, and is trained with machine-generated multimodal language-image instruction-following data. Building on the foundation set by LLaVA, NeVA further enhances training by leveraging features of the NeMo LLM framework such as model parallelism, activation checkpointing, AMP O2, Flash Attention, and more. While traditional language models have been primarily focused on textual processing, NeVA boldly adopts a holistic approach, bridging visual and linguistic comprehension.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	Yes	Yes
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	Yes (Uniform or Block)	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	No	Yes
AMP/BF16	Yes	No
BF16 O2	Yes	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton supported
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	No	N/A
TorchInductor	No	N/A
Flash Attention	Yes	N/A

Previous Multimodal Language Models

Next Data Preparation