Support Matrix#

Hardware#

NVIDIA NIMs for vision language models (VLMs) should, but are not guaranteed to, run on any NVIDIA GPU, provided the GPU has sufficient memory. They can also run on multiple homogeneous NVIDIA GPUs with sufficient aggregate memory and a CUDA compute capability > of > 7.0 (8.0 for bfloat16). See the following Supported Models section for further information.

Software#

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

Supported Models#

These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

nemoretriever-parse#

nemoretriever-parse is a tiny autoregressive Visual Language Model (VLM) designed for document transcription from images. You supply an input image, and nemoretriever-parse outputs its text in reading order and information about the document structure. nemoretriever-parse leverages Commercial Radio (C-RADIO) for visual feature extraction and mBART as the decoder for generating text outputs.

Important

This model takes requests with a single image, and images larger than 2048x1648px are scaled down.

Important

This model doesn’t support text input.

Optimized Configurations#

NVIDIA recommends at least 30GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H100 SXM

80

BF16

Throughput

1

A100 SXM

80

BF16

Throughput

1

L40S

48

BF16

Throughput

1

Local Build Optimized Configurations#

For GPU configurations not listed above, NIM for VLMs offers support through the local build configuration. Any NVIDIA GPU with sufficient memory should be able to build and run this model (though this isn’t guaranteed).

A local build starts automatically if no suitable GPU configuration is found.

Note

Requires a GPU with compute capability >= 8.0.

The GPU Memory and Disk Space values are in GB.

GPU Memory

Precision

Disk Space

10

BF16

30

Llama-3.2-11B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 11B model is recommended for users who want to prioritize response speed and have a moderate compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Optimized Configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H200 SXM

141

BF16

Latency

2

H200 SXM

141

FP8

Latency

2

H200 SXM

141

BF16

Throughput

1

H200 SXM

141

FP8

Throughput

1

H100 SXM

80

BF16

Latency

2

H100 SXM

80

FP8

Latency

2

H100 SXM

80

BF16

Throughput

1

H100 SXM

80

FP8

Throughput

1

A100 SXM

80

BF16

Latency

2

A100 SXM

80

BF16

Throughput

1

H100 PCIe

80

BF16

Latency

2

H100 PCIe

80

FP8

Latency

2

H100 PCIe

80

BF16

Throughput

1

H100 PCIe

80

FP8

Throughput

1

A100 PCIe

80

BF16

Latency

2

A100 PCIe

80

BF16

Throughput

1

L40S

48

BF16

Latency

4

L40S

48

BF16

Throughput

2

A10G

24

BF16

Latency

8

A10G

24

BF16

Throughput

4

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory and Disk Space values are in GB

GPU Memory

Precision

Disk Space

60

BF16

50

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.

Llama-3.2-90B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 90B model is recommended for users who want to prioritize model accuracy and have a high compute budget.

Important

This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.

Important

This model does not support tool use.

Optimized Configurations#

NVIDIA recommends at least 200GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H200 SXM

141

BF16

Latency

4

H200 SXM

141

FP8

Latency

2

H200 SXM

141

BF16

Throughput

2

H200 SXM

141

FP8

Throughput

1

H100 SXM

80

BF16

Latency

8

H100 SXM

80

FP8

Latency

4

H100 SXM

80

BF16

Throughput

4

H100 SXM

80

FP8

Throughput

2

A100 SXM

80

BF16

Latency

8

A100 SXM

80

BF16

Throughput

4

H100 PCIe

80

BF16

Latency

8

H100 PCIe

80

FP8

Latency

4

H100 PCIe

80

BF16

Throughput

4

H100 PCIe

80

FP8

Throughput

2

A100 PCIe

80

BF16

Latency

8

A100 PCIe

80

BF16

Throughput

4

L40S

48

BF16

Throughput

8

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory values are in GB; Profile is for what the model is optimized.

GPU Memory

Precision

Disk Space

240

BF16

200

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.