Support Matrix#

Hardware#

Unless specified otherwise, NVIDIA NIM for vision language models (VLMs) should, but are not guaranteed to, run on any NVIDIA GPU, provided the GPU has sufficient memory. They can also run on multiple homogeneous NVIDIA GPUs with sufficient aggregate memory and a CUDA compute capability of >= 7.0 (8.0 for bfloat16). For more information, refer to Supported Models.

NVIDIA NIM for VLMs does not support NVIDIA Virtual GPU (vGPU) environments.

For information on the supported operating systems, drivers, and software, refer to the Get Started page.

Supported Models#

Mistral Large 3 675B Instruct 2512#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.6.0

Hardware#

Mistral Large 3 675B Instruct 2512 is only supported on the following GPUs:

B200
H200 SXM
H200 NVL

Mulitple GPUs of the same type are required (refer to the following table). Generic configurations aren’t supported.

The GPU Memory and Disk Space values are in GB.

GPU	GPU Memory	Precision	# of GPUs	Disk Space
B200	192	FP8	8	638
H200 SXM / NVL	141	FP8	8	638

Cosmos Reason2#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.6.0

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16, 8.9 for FP8).

This model comes in two sizes: 2B and 8B.

The following configurations support Cosmos Reason2 8B. The GPU Memory and Disk Space values are in GB.

GPU	GPU Memory	Precision	# of GPUs	Disk Space	Speculative Decoding Support
Any	> 56	BF16	1 or 2	30	No
B300	288	FP8	1 or 2	20	No
B200	192	FP8	1 or 2	20	No
GB200	192	FP8	1 or 2	20	No
DGX Spark	128	FP8	1	20	No
RTX PRO 6000 BWE	96	FP8	1	20	No
RTX PRO 6000 BSE	96	FP8	1 or 2	20	No
H100 SXM / PCIe	80	FP8	1 or 2	20	Yes
H100 NVL	94	FP8	1 or 2	20	Yes
H200 SXM / NVL	141	FP8	1 or 2	20	Yes
GH200	141	FP8	1 or 2	20	Yes
H20	96	FP8	1 or 2	20	Yes
L40S	48	FP8	1 or 2	20	Yes

The following configurations support Cosmos Reason2 2B. The GPU Memory and Disk Space values are in GB.

GPU	GPU Memory	Precision	# of GPUs	Disk Space	Speculative Decoding Support
Any	> 36	BF16	1 or 2	30	No
B300	288	FP8	1 or 2	20	No
B200	192	FP8	1 or 2	20	No
GB200	192	FP8	1 or 2	20	No
DGX Spark	128	FP8	1	20	No
RTX PRO 6000 BWE	96	FP8	1	20	No
RTX PRO 6000 BSE	96	FP8	1 or 2	20	No
H100 SXM / PCIe	80	FP8	1 or 2	20	Yes
H100 NVL	94	FP8	1 or 2	20	Yes
H200 SXM / NVL	141	FP8	1 or 2	20	Yes
H20	96	FP8	1 or 2	20	Yes
GH200	141	FP8	1 or 2	20	Yes
L40S	48	FP8	1 or 2	20	Yes

Speculative Decoding#

This model supports speculative decoding acceleration, which enables faster token generation at lower concurrencies. You can use this feature by selecting profiles with the eagle tag in their name. Speculative decoding profiles are not selected by default.

Supported codecs and video formats#

This model supports the following codecs and video formats for an input video:

Supported codecs: H264, H265, VP9, FLV
Supported video formats: MP4, MKV, FLV, 3GP

Note

This model does not support the H264 or H265 codecs with the MKV video format.

Nemotron Nano 12B v2 VL#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.6.0 (latest)
1.5.0

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16, 8.9 for FP8).

The GPU Memory and Disk Space values are in GB.

GPU	GPU Memory	Precision	# of GPUs	Disk Space
Any	> 48	BF16	1 or 2	30
B200	192	NVFP4, FP8	1 or 2	20
H100 SXM / PCIe	80	FP8	1 or 2	20
H100 NVL	94	FP8	1 or 2	20
H200 SXM / NVL	141	FP8	1 or 2	20
GB200	192	NVFP4, FP8	1 or 2	20
GH200	141	FP8	1 or 2	20
L40S	48	FP8	1 or 2	20
RTX PRO 6000 BWE	96	NVFP4, FP8	1	20
RTX PRO 6000 BSE	96	NVFP4, FP8	1 or 2	20

Supported codecs and video formats#

This model supports the following codecs and video formats as an input video:

Supported codecs: H264, H265, VP8, VP9, FLV
Supported video formats: MP4, FLV, 3GP

Video size constraints#

The recommended maximum file size is 1GB per request.

Higher concurrent request sizes increase the risk of out of memory (OOM) errors. Reduce the video input size further at high concurrency if an OOM error is returned, or reduce the concurrency level.

Nemotron Parse#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.5.0

Optimized Configurations#

NVIDIA recommends at least 30GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H100 SXM	80	BF16	Throughput	1
A100 SXM	80	BF16	Throughput	1
L40S	48	BF16	Throughput	1
A10G	24	BF16	Throughput	1

Cosmos Reason1 7B#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.4.1 (latest)
1.4.0

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16, 8.9 for FP8).

The GPU Memory and Disk Space values are in GB.

GPU Memory	Precision	Disk Space
24	BF16	16
16	FP8	10

Supported codecs and video formats#

This model supports the following codecs and video formats as an input video:

Supported codecs: H264, H265, VP9, FLV
Supported video formats: MP4, MKV, FLV, 3GP

Note

This model does not support the H264 or H265 codecs with the MKV video format.

Llama 4 Maverick 17B 128E Instruct#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.4.0

Overview#

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Hardware#

Llama 4 Maverick 17B 128E Instruct is only supported on a node of eight of one of the following GPUs:

H100 SXM
H100 NVL
H200 SXM
H200 NVL

Generic configurations aren’t supported.

A context length of 1,000,000 is supported on H200 nodes only. For an H100 (SXM or NVL) node, a context length of 430,000 is supported.

Llama 4 Maverick 17B 128E Instruct uses Meta’s official FP8 checkpoints only.

Mistral Small 3.2 24B Instruct 2506#

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.3.1

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory and Disk Space values are in GB.

GPU Memory	Precision	Disk Space
68	BF16	50

Llama 4 Scout 17B 16E Instruct#

Latest Supported Release Version#

This model is supported in the following VLM release versions:

1.3.2 (latest)
1.3.1
1.3.0

Overview#

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Non-optimized Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. Llama-4 is a mixture-of-experts (MoE) based model. The total number of parameters is 109 billion, and the active number of parameters is 17 billion.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16) and < 10.

Important

The NIM supports a maximum context length of 128K (131,072 tokens) for Llama-4.

Important

Llama-4 is a mixture-of-experts (MoE) based model, with a total of 109 billion parameters and 17 billion active parameters. The GPU memory required is based on the model’s total number of parameters (109B) and the ability to support a sequence of full context length.

The GPU Memory and Disk Space values are in GB

GPU Memory	Precision	Disk Space
250	BF16	240
250	FP8 (dynamic)	240

Important

For the FP8 profile, the same memory as BF16 is required because FP8 quantization happens on the fly, implying that BF16 weights must be loaded into memory first. The model would fit in a 4xH100 SXM setup.

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for the KV cache of a full-sized sequence.

Llama 3.1 Nemotron Nano VL 8B v1#

Latest Supported Release Version#

This model is supported in the following VLM release versions:

1.3.1 (latest)
1.3.0

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16) and < 10, and at least one GPU with 95% or greater free memory.

The GPU Memory and Disk Space values are in GB.

GPU Memory	Precision	Disk Space
24	FP8	32
24	BF16	40

Supported TRT-LLM Buildable Profiles#

Precision: FP8, BF16
# of GPUs: 1

nemoretriever-parse#

nemoretriever-parse is a tiny autoregressive Visual Language Model (VLM) designed for document transcription from images. You supply an input image, and nemoretriever-parse outputs its text in reading order and information about the document structure. nemoretriever-parse leverages Commercial Radio (C-RADIO) for visual feature extraction and mBART as the decoder for generating text outputs.

Important

This model takes requests with a single image, and images larger than 2048x1648px are scaled down.

Important

This model doesn’t support text input.

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.2.0

Documentation for this model is not available in the selected VLM release. Refer to the documentation for version 1.2.0.

Optimized Configurations#

NVIDIA recommends at least 30GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H100 SXM	80	BF16	Throughput	1
A100 SXM	80	BF16	Throughput	1
L40S	48	BF16	Throughput	1

Local Build Optimized Configurations#

For GPU configurations not listed above, NIM for VLMs offers support through the local build configuration. Any NVIDIA GPU with sufficient memory should be able to build and run this model (though this isn’t guaranteed).

A local build starts automatically if no suitable GPU configuration is found.

Note

Requires a GPU with compute capability >= 8.0 and < 10.

The GPU Memory and Disk Space values are in GB.

GPU Memory	Precision	Disk Space
10	BF16	30

Llama-3.2-11B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 11B model is recommended for users who want to prioritize response speed and have a moderate compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.1.1

Documentation for this model is not available in the selected VLM release. Refer to the documentation for version 1.1.1.

Optimized Configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H200 SXM	141	BF16	Latency	2
H200 SXM	141	FP8	Latency	2
H200 SXM	141	BF16	Throughput	1
H200 SXM	141	FP8	Throughput	1
H100 SXM	80	BF16	Latency	2
H100 SXM	80	FP8	Latency	2
H100 SXM	80	BF16	Throughput	1
H100 SXM	80	FP8	Throughput	1
A100 SXM	80	BF16	Latency	2
A100 SXM	80	BF16	Throughput	1
H100 PCIe	80	BF16	Latency	2
H100 PCIe	80	FP8	Latency	2
H100 PCIe	80	BF16	Throughput	1
H100 PCIe	80	FP8	Throughput	1
A100 PCIe	80	BF16	Latency	2
A100 PCIe	80	BF16	Throughput	1
L40S	48	BF16	Latency	4
L40S	48	BF16	Throughput	2
A10G	24	BF16	Latency	8
A10G	24	BF16	Throughput	4

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16) and < 10.

The GPU Memory and Disk Space values are in GB

GPU Memory	Precision	Disk Space
60	BF16	50

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.

Llama-3.2-90B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 90B model is recommended for users who want to prioritize model accuracy and have a high compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Latest Supported Release Version#

This model is supported in the following VLM release version:

1.1.1

Documentation for this model is not available in the selected VLM release. Refer to the documentation for version 1.1.1.

Optimized Configurations#

NVIDIA recommends at least 200GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU	GPU Memory	Precision	Profile	# of GPUs
H200 SXM	141	BF16	Latency	4
H200 SXM	141	FP8	Latency	2
H200 SXM	141	BF16	Throughput	2
H200 SXM	141	FP8	Throughput	1
H100 SXM	80	BF16	Latency	8
H100 SXM	80	FP8	Latency	4
H100 SXM	80	BF16	Throughput	4
H100 SXM	80	FP8	Throughput	2
A100 SXM	80	BF16	Latency	8
A100 SXM	80	BF16	Throughput	4
H100 PCIe	80	BF16	Latency	8
H100 PCIe	80	FP8	Latency	4
H100 PCIe	80	BF16	Throughput	4
H100 PCIe	80	FP8	Throughput	2
A100 PCIe	80	BF16	Latency	8
A100 PCIe	80	BF16	Throughput	4
L40S	48	BF16	Throughput	8

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16) and < 10.

The GPU Memory values are in GB; Profile is for what the model is optimized.

GPU Memory	Precision	Disk Space
240	BF16	200

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.