Support Matrix#

Hardware#

NVIDIA NIMs for vision language models (VLMs) should, but are not guaranteed to, run on any NVIDIA GPU, provided the GPU has sufficient memory. They can also run on multiple homogeneous NVIDIA GPUs with sufficient aggregate memory and a CUDA compute capability > of > 7.0 (8.0 for bfloat16). See the following Supported Models section for further information.

Software#

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

Supported Models#

These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Llama-3.2-11B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 11B model is recommended for users who want to prioritize response speed and have a moderate compute budget.

Important

This model only takes requests with a single image.

Important

This model does not support tool use.

Optimized Configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H200 SXM

141

BF16

Latency

2

H200 SXM

141

FP8

Latency

2

H200 SXM

141

BF16

Throughput

1

H200 SXM

141

FP8

Throughput

1

H100 SXM

80

BF16

Latency

2

H100 SXM

80

FP8

Latency

2

H100 SXM

80

BF16

Throughput

1

H100 SXM

80

FP8

Throughput

1

A100 SXM

80

BF16

Latency

2

A100 SXM

80

BF16

Throughput

1

H100 PCIe

80

BF16

Latency

2

H100 PCIe

80

FP8

Latency

2

H100 PCIe

80

BF16

Throughput

1

H100 PCIe

80

FP8

Throughput

1

A100 PCIe

80

BF16

Latency

2

A100 PCIe

80

BF16

Throughput

1

L40S

48

BF16

Latency

4

L40S

48

BF16

Throughput

2

A10G

24

BF16

Latency

8

A10G

24

BF16

Throughput

4

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory and Disk Space values are in GB

GPU Memory

Precision

Disk Space

60

BF16

50

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.

Llama-3.2-90B-Vision-Instruct#

Overview#

The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.

The 90B model is recommended for users who want to prioritize model accuracy and have a high compute budget.

Important

This model only takes requests with a single image.

Important

This model does not support tool use.

Optimized Configurations#

NVIDIA recommends at least 200GB disk space for the container and model.

The GPU Memory values are in GB; The Profile indicates what the model is optimized for.

GPU

GPU Memory

Precision

Profile

# of GPUs

H200 SXM

141

BF16

Latency

4

H200 SXM

141

FP8

Latency

2

H200 SXM

141

BF16

Throughput

2

H200 SXM

141

FP8

Throughput

1

H100 SXM

80

BF16

Latency

8

H100 SXM

80

FP8

Latency

4

H100 SXM

80

BF16

Throughput

4

H100 SXM

80

FP8

Throughput

2

A100 SXM

80

BF16

Latency

8

A100 SXM

80

BF16

Throughput

4

H100 PCIe

80

BF16

Latency

8

H100 PCIe

80

FP8

Latency

4

H100 PCIe

80

BF16

Throughput

4

H100 PCIe

80

FP8

Throughput

2

A100 PCIe

80

BF16

Latency

8

A100 PCIe

80

BF16

Throughput

4

L40S

48

BF16

Throughput

8

Non-optimized Configuration#

For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.

Important

Requires compute capability >= 7.0 (8.0 for bfloat16).

The GPU Memory values are in GB; Profile is for what the model is optimized.

GPU Memory

Precision

Disk Space

240

BF16

200

Important

Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.