Support Matrix#

Hardware#

NVIDIA NIMs for vision language models (VLMs) should, but are not guaranteed to, run on any NVIDIA GPU, provided the GPU has sufficient memory. They can also run on multiple homogeneous NVIDIA GPUs with sufficient aggregate memory and a CUDA compute capability > of > 7.0 (8.0 for bfloat16). See the following Supported Models section for further information.

Software#

Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1

Supported Models#

These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Llama-3.2-11B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 11B model is recommended for users who want to prioritize response speed and have a moderate compute budget.

Important

This model only takes requests with a single image.

Important

This model does not support tool use.

Optimized Configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H200 SXM	141	BF16	Latency	2
H200 SXM	141	FP8	Latency	2
H200 SXM	141	BF16	Throughput	1
H200 SXM	141	FP8	Throughput	1
H100 SXM	80	BF16	Latency	2
H100 SXM	80	FP8	Latency	2
H100 SXM	80	BF16	Throughput	1
H100 SXM	80	FP8	Throughput	1
A100 SXM	80	BF16	Latency	2
A100 SXM	80	BF16	Throughput	1
H100 PCIe	80	BF16	Latency	2
H100 PCIe	80	FP8	Latency	2
H100 PCIe	80	BF16	Throughput	1
H100 PCIe	80	FP8	Throughput	1
A100 PCIe	80	BF16	Latency	2
A100 PCIe	80	BF16	Throughput	1
L40S	48	BF16	Latency	4
L40S	48	BF16	Throughput	2
A10G	24	BF16	Latency	8
A10G	24	BF16	Throughput	4

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory and Disk Space values are in GB

GPU Memory	Precision	Disk Space
60	BF16	50

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.

Llama-3.2-90B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 90B model is recommended for users who want to prioritize model accuracy and have a high compute budget.

Important

This model only takes requests with a single image.

Important

This model does not support tool use.

Optimized Configurations#

NVIDIA recommends at least 200GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H200 SXM	141	BF16	Latency	4
H200 SXM	141	FP8	Latency	2
H200 SXM	141	BF16	Throughput	2
H200 SXM	141	FP8	Throughput	1
H100 SXM	80	BF16	Latency	8
H100 SXM	80	FP8	Latency	4
H100 SXM	80	BF16	Throughput	4
H100 SXM	80	FP8	Throughput	2
A100 SXM	80	BF16	Latency	8
A100 SXM	80	BF16	Throughput	4
H100 PCIe	80	BF16	Latency	8
H100 PCIe	80	FP8	Latency	4
H100 PCIe	80	BF16	Throughput	4
H100 PCIe	80	FP8	Throughput	2
A100 PCIe	80	BF16	Latency	8
A100 PCIe	80	BF16	Throughput	4
L40S	48	BF16	Throughput	8

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory values are in GB; Profile is for what the model is optimized.

GPU Memory	Precision	Disk Space
240	BF16	200

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.