NeVA (LLaVA)
Originating from LLaVA (Large Language and Vision Assistant), NeVA is a groundbreaking addition to the NeMo Multimodal ecosystem. This model seamlessly integrates large language-centric models (like NVGPT or Llama2) with a vision encoder, and is trained with machine-generated multimodal language-image instruction-following data. Building on the foundation set by LLaVA, NeVA further enhances training by leveraging features of the NeMo LLM framework such as model parallelism, activation checkpointing, AMP O2, Flash Attention, and more. While traditional language models have been primarily focused on textual processing, NeVA boldly adopts a holistic approach, bridging visual and linguistic comprehension.
Feature |
Training |
Inference |
---|---|---|
Data parallelism |
Yes |
N/A |
Tensor parallelism |
Yes |
Yes |
Pipeline parallelism |
Yes |
No |
Sequence parallelism |
Yes |
No |
Activation checkpointing |
Yes (Uniform or Block) |
No |
FP32/TF32 |
Yes |
Yes (FP16 enabled by default) |
AMP/FP16 |
No |
Yes |
AMP/BF16 |
Yes |
No |
BF16 O2 |
Yes |
No |
TransformerEngine/FP8 |
Yes |
No |
Multi-GPU |
Yes |
Yes |
Multi-Node |
Yes |
Yes |
Inference deployment |
N/A |
|
SW stack support |
Slurm DeepOps/Base Command Manager/Base Command Platform |
Slurm DeepOps/Base Command Manager/Base Command Platform |
NVfuser |
No |
N/A |
Distributed Optimizer |
No |
N/A |
TorchInductor |
No |
N/A |
Flash Attention |
Yes |
N/A |