Support Matrix#

Hardware#

NVIDIA NIMs for visual language models should, but are not guaranteed to, run on any supported NVIDIA GPU. For further information, see the Supported Models section.

Software#

Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1

Supported Models#

VILA NIM for VLMs models are optimized with mixed precisions. They use TensorRT FP16 precision for vision encoding, TensorRT-LLM with AWQ quantization for LLM decoding, and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Optimized Configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized. Precision is mixed with FP16 vision encoder and AWQ quantized LLM decoder.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100 SXM	80	FP16+AWQ	Throughput	1	20
H100 PCIe	80	FP16+AWQ	Throughput	1	20
H100L	94	FP16+AWQ	Throughput	1	20
A100 SXM	80	FP16+AWQ	Throughput	1	20
A100 PCIe	80	FP16+AWQ	Throughput	1	20
L40S	46	FP16+AWQ	Throughput	1	20

Non-optimized Configuration#

VILA NIM for VLMs does not currently support non-optimized configurations. Attempting to deploy on GPUs not listed in the previous section will fail.