Support Matrix#
Hardware#
NVIDIA NIMs for visual language models should, but are not guaranteed to, run on any supported NVIDIA GPU. For further information, see the Supported Models section.
Software#
Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
Supported Models#
VILA NIM for VLMs models are optimized with mixed precisions. They use TensorRT FP16 precision for vision encoding, TensorRT-LLM with AWQ quantization for LLM decoding, and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.
Optimized Configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized. Precision is mixed with FP16 vision encoder and AWQ quantized LLM decoder.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM |
80 |
FP16+AWQ |
Throughput |
1 |
20 |
H100 PCIe |
80 |
FP16+AWQ |
Throughput |
1 |
20 |
H100L |
94 |
FP16+AWQ |
Throughput |
1 |
20 |
A100 SXM |
80 |
FP16+AWQ |
Throughput |
1 |
20 |
A100 PCIe |
80 |
FP16+AWQ |
Throughput |
1 |
20 |
L40S |
46 |
FP16+AWQ |
Throughput |
1 |
20 |
Non-optimized Configuration#
VILA NIM for VLMs does not currently support non-optimized configurations. Attempting to deploy on GPUs not listed in the previous section will fail.