Support Matrix#

Hardware#

Unless specified otherwise, NVIDIA NIM for vision language models (VLMs) should, but are not guaranteed to, run on any NVIDIA GPU, provided the GPU has sufficient memory. They can also run on multiple homogeneous NVIDIA GPUs with sufficient aggregate memory and a CUDA compute capability of >= 7.0 (8.0 for bfloat16). For more information, refer to Supported Models.

NVIDIA NIM for VLMs does not support NVIDIA Virtual GPU (vGPU) environments.

For information on the supported operating systems, drivers, and software, refer to the Get Started page.

Supported Models#

Mistral Large 3 675B Instruct 2512#

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.6.0

Hardware#

Mistral Large 3 675B Instruct 2512 is only supported on the following GPUs:

  • B200

  • H200 SXM

  • H200 NVL

Mulitple GPUs of the same type are required (refer to the following table). Generic configurations aren’t supported.

The GPU Memory and Disk Space values are in GB.

GPU

GPU Memory

Precision

# of GPUs

Disk Space

B200

192

FP8

8

638

H200 SXM / NVL

141

FP8

8

638

Cosmos Reason2#

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.6.0

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16, 8.9 for FP8).

This model comes in two sizes: 2B and 8B.

The following configurations support Cosmos Reason2 8B. The GPU Memory and Disk Space values are in GB.

GPU

GPU Memory

Precision

# of GPUs

Disk Space

Speculative Decoding Support

Any

> 56

BF16

1 or 2

30

No

B300

288

FP8

1 or 2

20

No

B200

192

FP8

1 or 2

20

No

GB200

192

FP8

1 or 2

20

No

DGX Spark

128

FP8

1

20

No

RTX PRO 6000 BWE

96

FP8

1

20

No

RTX PRO 6000 BSE

96

FP8

1 or 2

20

No

H100 SXM / PCIe

80

FP8

1 or 2

20

Yes

H100 NVL

94

FP8

1 or 2

20

Yes

H200 SXM / NVL

141

FP8

1 or 2

20

Yes

GH200

141

FP8

1 or 2

20

Yes

H20

96

FP8

1 or 2

20

Yes

L40S

48

FP8

1 or 2

20

Yes

The following configurations support Cosmos Reason2 2B. The GPU Memory and Disk Space values are in GB.

GPU

GPU Memory

Precision

# of GPUs

Disk Space

Speculative Decoding Support

Any

> 36

BF16

1 or 2

30

No

B300

288

FP8

1 or 2

20

No

B200

192

FP8

1 or 2

20

No

GB200

192

FP8

1 or 2

20

No

DGX Spark

128

FP8

1

20

No

RTX PRO 6000 BWE

96

FP8

1

20

No

RTX PRO 6000 BSE

96

FP8

1 or 2

20

No

H100 SXM / PCIe

80

FP8

1 or 2

20

Yes

H100 NVL

94

FP8

1 or 2

20

Yes

H200 SXM / NVL

141

FP8

1 or 2

20

Yes

H20

96

FP8

1 or 2

20

Yes

GH200

141

FP8

1 or 2

20

Yes

L40S

48

FP8

1 or 2

20

Yes

Speculative Decoding#

This model supports speculative decoding acceleration, which enables faster token generation at lower concurrencies. You can use this feature by selecting profiles with the eagle tag in their name. Speculative decoding profiles are not selected by default.

Supported codecs and video formats#

This model supports the following codecs and video formats for an input video:

  • Supported codecs: H264, H265, VP9, FLV

  • Supported video formats: MP4, MKV, FLV, 3GP

Note

This model does not support the H264 or H265 codecs with the MKV video format.

Nemotron Nano 12B v2 VL#

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.6.0 (latest)

  • 1.5.0

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16, 8.9 for FP8).

The GPU Memory and Disk Space values are in GB.

GPU

GPU Memory

Precision

# of GPUs

Disk Space

Any

> 48

BF16

1 or 2

30

B200

192

NVFP4, FP8

1 or 2

20

H100 SXM / PCIe

80

FP8

1 or 2

20

H100 NVL

94

FP8

1 or 2

20

H200 SXM / NVL

141

FP8

1 or 2

20

GB200

192

NVFP4, FP8

1 or 2

20

GH200

141

FP8

1 or 2

20

L40S

48

FP8

1 or 2

20

RTX PRO 6000 BWE

96

NVFP4, FP8

1

20

RTX PRO 6000 BSE

96

NVFP4, FP8

1 or 2

20

Supported codecs and video formats#

This model supports the following codecs and video formats as an input video:

  • Supported codecs: H264, H265, VP8, VP9, FLV

  • Supported video formats: MP4, FLV, 3GP

Video size constraints#

The recommended maximum file size is 1GB per request.

Higher concurrent request sizes increase the risk of out of memory (OOM) errors. Reduce the video input size further at high concurrency if an OOM error is returned, or reduce the concurrency level.

Nemotron Parse#

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.5.0

Optimized Configurations#

NVIDIA recommends at least 30GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H100 SXM

80

BF16

Throughput

1

A100 SXM

80

BF16

Throughput

1

L40S

48

BF16

Throughput

1

A10G

24

BF16

Throughput

1

Cosmos Reason1 7B#

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.4.1 (latest)

  • 1.4.0

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16, 8.9 for FP8).

The GPU Memory and Disk Space values are in GB.

GPU Memory

Precision

Disk Space

24

BF16

16

16

FP8

10

Supported codecs and video formats#

This model supports the following codecs and video formats as an input video:

  • Supported codecs: H264, H265, VP9, FLV

  • Supported video formats: MP4, MKV, FLV, 3GP

Note

This model does not support the H264 or H265 codecs with the MKV video format.

Llama 4 Maverick 17B 128E Instruct#

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.4.0

Overview#

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Hardware#

Llama 4 Maverick 17B 128E Instruct is only supported on a node of eight of one of the following GPUs:

  • H100 SXM

  • H100 NVL

  • H200 SXM

  • H200 NVL

Generic configurations aren’t supported.

A context length of 1,000,000 is supported on H200 nodes only. For an H100 (SXM or NVL) node, a context length of 430,000 is supported.

Llama 4 Maverick 17B 128E Instruct uses Meta’s official FP8 checkpoints only.

Mistral Small 3.2 24B Instruct 2506#

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.3.1

Generic Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. It requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory and Disk Space values are in GB.

GPU Memory

Precision

Disk Space

68

BF16

50

Llama 4 Scout 17B 16E Instruct#

Latest Supported Release Version#

This model is supported in the following VLM release versions:

  • 1.3.2 (latest)

  • 1.3.1

  • 1.3.0

Overview#

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Non-optimized Configuration#

NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed. Llama-4 is a mixture-of-experts (MoE) based model. The total number of parameters is 109 billion, and the active number of parameters is 17 billion.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16) and < 10.

Important

The NIM supports a maximum context length of 128K (131,072 tokens) for Llama-4.

Important

Llama-4 is a mixture-of-experts (MoE) based model, with a total of 109 billion parameters and 17 billion active parameters. The GPU memory required is based on the model’s total number of parameters (109B) and the ability to support a sequence of full context length.

The GPU Memory and Disk Space values are in GB

GPU Memory

Precision

Disk Space

250

BF16

240

250

FP8 (dynamic)

240

Important

For the FP8 profile, the same memory as BF16 is required because FP8 quantization happens on the fly, implying that BF16 weights must be loaded into memory first. The model would fit in a 4xH100 SXM setup.

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for the KV cache of a full-sized sequence.

Llama 3.1 Nemotron Nano VL 8B v1#

Latest Supported Release Version#

This model is supported in the following VLM release versions:

  • 1.3.1 (latest)

  • 1.3.0

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16) and < 10, and at least one GPU with 95% or greater free memory.

The GPU Memory and Disk Space values are in GB.

GPU Memory

Precision

Disk Space

24

FP8

32

24

BF16

40

Supported TRT-LLM Buildable Profiles#

  • Precision: FP8, BF16

  • # of GPUs: 1

nemoretriever-parse#

nemoretriever-parse is a tiny autoregressive Visual Language Model (VLM) designed for document transcription from images. You supply an input image, and nemoretriever-parse outputs its text in reading order and information about the document structure. nemoretriever-parse leverages Commercial Radio (C-RADIO) for visual feature extraction and mBART as the decoder for generating text outputs.

Important

This model takes requests with a single image, and images larger than 2048x1648px are scaled down.

Important

This model doesn’t support text input.

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.2.0

Documentation for this model is not available in the selected VLM release. Refer to the documentation for version 1.2.0.

Optimized Configurations#

NVIDIA recommends at least 30GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H100 SXM

80

BF16

Throughput

1

A100 SXM

80

BF16

Throughput

1

L40S

48

BF16

Throughput

1

Local Build Optimized Configurations#

For GPU configurations not listed above, NIM for VLMs offers support through the local build configuration. Any NVIDIA GPU with sufficient memory should be able to build and run this model (though this isn’t guaranteed).

A local build starts automatically if no suitable GPU configuration is found.

Note

Requires a GPU with compute capability >= 8.0 and < 10.

The GPU Memory and Disk Space values are in GB.

GPU Memory

Precision

Disk Space

10

BF16

30

Llama-3.2-11B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 11B model is recommended for users who want to prioritize response speed and have a moderate compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.1.1

Documentation for this model is not available in the selected VLM release. Refer to the documentation for version 1.1.1.

Optimized Configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H200 SXM

141

BF16

Latency

2

H200 SXM

141

FP8

Latency

2

H200 SXM

141

BF16

Throughput

1

H200 SXM

141

FP8

Throughput

1

H100 SXM

80

BF16

Latency

2

H100 SXM

80

FP8

Latency

2

H100 SXM

80

BF16

Throughput

1

H100 SXM

80

FP8

Throughput

1

A100 SXM

80

BF16

Latency

2

A100 SXM

80

BF16

Throughput

1

H100 PCIe

80

BF16

Latency

2

H100 PCIe

80

FP8

Latency

2

H100 PCIe

80

BF16

Throughput

1

H100 PCIe

80

FP8

Throughput

1

A100 PCIe

80

BF16

Latency

2

A100 PCIe

80

BF16

Throughput

1

L40S

48

BF16

Latency

4

L40S

48

BF16

Throughput

2

A10G

24

BF16

Latency

8

A10G

24

BF16

Throughput

4

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16) and < 10.

The GPU Memory and Disk Space values are in GB

GPU Memory

Precision

Disk Space

60

BF16

50

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.

Llama-3.2-90B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 90B model is recommended for users who want to prioritize model accuracy and have a high compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Latest Supported Release Version#

This model is supported in the following VLM release version:

  • 1.1.1

Documentation for this model is not available in the selected VLM release. Refer to the documentation for version 1.1.1.

Optimized Configurations#

NVIDIA recommends at least 200GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H200 SXM

141

BF16

Latency

4

H200 SXM

141

FP8

Latency

2

H200 SXM

141

BF16

Throughput

2

H200 SXM

141

FP8

Throughput

1

H100 SXM

80

BF16

Latency

8

H100 SXM

80

FP8

Latency

4

H100 SXM

80

BF16

Throughput

4

H100 SXM

80

FP8

Throughput

2

A100 SXM

80

BF16

Latency

8

A100 SXM

80

BF16

Throughput

4

H100 PCIe

80

BF16

Latency

8

H100 PCIe

80

FP8

Latency

4

H100 PCIe

80

BF16

Throughput

4

H100 PCIe

80

FP8

Throughput

2

A100 PCIe

80

BF16

Latency

8

A100 PCIe

80

BF16

Throughput

4

L40S

48

BF16

Throughput

8

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16) and < 10.

The GPU Memory values are in GB; Profile is for what the model is optimized.

GPU Memory

Precision

Disk Space

240

BF16

200

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.