Support Matrix#

Hardware#

NVIDIA NIMs for vision language models (VLMs) should, but are not guaranteed to, run on any NVIDIA GPU, provided the GPU has sufficient memory. They can also run on multiple homogeneous NVIDIA GPUs with sufficient aggregate memory and a CUDA compute capability > of > 7.0 (8.0 for bfloat16). See the following Supported Models section for further information.

Software#

Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1

Supported Models#

Llama 4 Scout 17B 16E Instruct#

Overview#

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Non-optimized Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. Llama-4 is a mixture-of-experts (MoE) based model. The total number of parameters is 109 billion, and the active number of parameters is 17 billion.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

Important

The NIM supports a maximum context length of 128K (131,072 tokens) for Llama-4.

Important

Llama-4 is a mixture-of-experts (MoE) based model, with a total of 109 billion parameters and 17 billion active parameters. The GPU memory required is based on the model’s total number of parameters (109B) and the ability to support a sequence of full context length.

The GPU Memory and Disk Space values are in GB

GPU Memory	Precision	Disk Space
250	BF16	240
250	FP8 (dynamic)	240

Important

For the FP8 profile, the same memory as BF16 is required because FP8 quantization happens on the fly, implying that BF16 weights must be loaded into memory first. The model would fit in a 4xH100 SXM setup.

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for the KV cache of a full-sized sequence.

Llama 3.1 Nemotron Nano VL 8B v1#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.3.0

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

The GPU Memory and Disk Space values are in GB.

GPU Memory	Precision	Disk Space
24	FP8	32
24	BF16	40

Supported TRT-LLM Buildable Profiles#

Precision: FP8, BF16
# of GPUs: 1

nemoretriever-parse#

nemoretriever-parse is a tiny autoregressive Visual Language Model (VLM) designed for document transcription from images. You supply an input image, and nemoretriever-parse outputs its text in reading order and information about the document structure. nemoretriever-parse leverages Commercial Radio (C-RADIO) for visual feature extraction and mBART as the decoder for generating text outputs.

Important

This model takes requests with a single image, and images larger than 2048x1648px are scaled down.

Important

This model doesn’t support text input.

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.2.0

Documentation for this model is not available in the selected VLM release. See the documentation for version 1.2.0.

Optimized Configurations#

NVIDIA recommends at least 30GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H100 SXM	80	BF16	Throughput	1
A100 SXM	80	BF16	Throughput	1
L40S	48	BF16	Throughput	1

Local Build Optimized Configurations#

For GPU configurations not listed above, NIM for VLMs offers support through the local build configuration. Any NVIDIA GPU with sufficient memory should be able to build and run this model (though this isn’t guaranteed).

A local build starts automatically if no suitable GPU configuration is found.

Note

Requires a GPU with compute capability >= 8.0.

The GPU Memory and Disk Space values are in GB.

GPU Memory	Precision	Disk Space
10	BF16	30

Llama-3.2-11B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 11B model is recommended for users who want to prioritize response speed and have a moderate compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.1.1

Documentation for this model is not available in the selected VLM release. See the documentation for version 1.1.1.

Optimized Configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H200 SXM	141	BF16	Latency	2
H200 SXM	141	FP8	Latency	2
H200 SXM	141	BF16	Throughput	1
H200 SXM	141	FP8	Throughput	1
H100 SXM	80	BF16	Latency	2
H100 SXM	80	FP8	Latency	2
H100 SXM	80	BF16	Throughput	1
H100 SXM	80	FP8	Throughput	1
A100 SXM	80	BF16	Latency	2
A100 SXM	80	BF16	Throughput	1
H100 PCIe	80	BF16	Latency	2
H100 PCIe	80	FP8	Latency	2
H100 PCIe	80	BF16	Throughput	1
H100 PCIe	80	FP8	Throughput	1
A100 PCIe	80	BF16	Latency	2
A100 PCIe	80	BF16	Throughput	1
L40S	48	BF16	Latency	4
L40S	48	BF16	Throughput	2
A10G	24	BF16	Latency	8
A10G	24	BF16	Throughput	4

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory and Disk Space values are in GB

GPU Memory	Precision	Disk Space
60	BF16	50

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.

Llama-3.2-90B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 90B model is recommended for users who want to prioritize model accuracy and have a high compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.1.1

Documentation for this model is not available in the selected VLM release. See the documentation for version 1.1.1.

Optimized Configurations#

NVIDIA recommends at least 200GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H200 SXM	141	BF16	Latency	4
H200 SXM	141	FP8	Latency	2
H200 SXM	141	BF16	Throughput	2
H200 SXM	141	FP8	Throughput	1
H100 SXM	80	BF16	Latency	8
H100 SXM	80	FP8	Latency	4
H100 SXM	80	BF16	Throughput	4
H100 SXM	80	FP8	Throughput	2
A100 SXM	80	BF16	Latency	8
A100 SXM	80	BF16	Throughput	4
H100 PCIe	80	BF16	Latency	8
H100 PCIe	80	FP8	Latency	4
H100 PCIe	80	BF16	Throughput	4
H100 PCIe	80	FP8	Throughput	2
A100 PCIe	80	BF16	Latency	8
A100 PCIe	80	BF16	Throughput	4
L40S	48	BF16	Throughput	8

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory values are in GB; Profile is for what the model is optimized.

GPU Memory	Precision	Disk Space
240	BF16	200

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.