Support Matrix#

Hardware#

NVIDIA NIMs for visual language models should, but are not guaranteed to, run on any supported NVIDIA GPU. For further information, see the Supported Models section.

Software#

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

Supported Models#

VILA NIM for VLMs models are optimized with mixed precisions. They use TensorRT FP16 precision for vision encoding, TensorRT-LLM with AWQ quantization for LLM decoding, and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Optimized Configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized. Precision is mixed with FP16 vision encoder and AWQ quantized LLM decoder.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM

80

FP16+AWQ

Throughput

1

20

H100 PCIe

80

FP16+AWQ

Throughput

1

20

H100L

94

FP16+AWQ

Throughput

1

20

A100 SXM

80

FP16+AWQ

Throughput

1

20

A100 PCIe

80

FP16+AWQ

Throughput

1

20

L40S

46

FP16+AWQ

Throughput

1

20

Non-optimized Configuration#

VILA NIM for VLMs does not currently support non-optimized configurations. Attempting to deploy on GPUs not listed in the previous section will fail.