Support Matrix#
Hardware#
NVIDIA NIMs for vision language models (VLMs) should, but are not guaranteed to, run on any NVIDIA GPU, provided the GPU has sufficient memory. They can also run on multiple homogeneous NVIDIA GPUs with sufficient aggregate memory and a CUDA compute capability > of > 7.0 (8.0 for bfloat16). See the following Supported Models section for further information.
Software#
Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
Supported Models#
These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.
nemoretriever-parse#
nemoretriever-parse is a tiny autoregressive Visual Language Model (VLM) designed for document transcription from images. You supply an input image, and nemoretriever-parse outputs its text in reading order and information about the document structure. nemoretriever-parse leverages Commercial Radio (C-RADIO) for visual feature extraction and mBART as the decoder for generating text outputs.
Important
This model takes requests with a single image, and images larger than 2048x1648px are scaled down.
Important
This model doesn’t support text input.
Optimized Configurations#
NVIDIA recommends at least 30GB disk space for the container and model.
The GPU Memory values are in GB; The Profile indicates what the model is optimized for.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
---|---|---|---|---|
H100 SXM |
80 |
BF16 |
Throughput |
1 |
A100 SXM |
80 |
BF16 |
Throughput |
1 |
L40S |
48 |
BF16 |
Throughput |
1 |
Local Build Optimized Configurations#
For GPU configurations not listed above, NIM for VLMs offers support through the local build configuration. Any NVIDIA GPU with sufficient memory should be able to build and run this model (though this isn’t guaranteed).
A local build starts automatically if no suitable GPU configuration is found.
Note
Requires a GPU with compute capability >= 8.0.
The GPU Memory and Disk Space values are in GB.
GPU Memory |
Precision |
Disk Space |
---|---|---|
10 |
BF16 |
30 |
Llama-3.2-11B-Vision-Instruct#
Overview#
The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.
The 11B model is recommended for users who want to prioritize response speed and have a moderate compute budget.
Important
This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.
Important
This model does not support tool use.
Optimized Configurations#
NVIDIA recommends at least 50GB disk space for the container and model.
The GPU Memory values are in GB; The Profile indicates what the model is optimized for.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
---|---|---|---|---|
H200 SXM |
141 |
BF16 |
Latency |
2 |
H200 SXM |
141 |
FP8 |
Latency |
2 |
H200 SXM |
141 |
BF16 |
Throughput |
1 |
H200 SXM |
141 |
FP8 |
Throughput |
1 |
H100 SXM |
80 |
BF16 |
Latency |
2 |
H100 SXM |
80 |
FP8 |
Latency |
2 |
H100 SXM |
80 |
BF16 |
Throughput |
1 |
H100 SXM |
80 |
FP8 |
Throughput |
1 |
A100 SXM |
80 |
BF16 |
Latency |
2 |
A100 SXM |
80 |
BF16 |
Throughput |
1 |
H100 PCIe |
80 |
BF16 |
Latency |
2 |
H100 PCIe |
80 |
FP8 |
Latency |
2 |
H100 PCIe |
80 |
BF16 |
Throughput |
1 |
H100 PCIe |
80 |
FP8 |
Throughput |
1 |
A100 PCIe |
80 |
BF16 |
Latency |
2 |
A100 PCIe |
80 |
BF16 |
Throughput |
1 |
L40S |
48 |
BF16 |
Latency |
4 |
L40S |
48 |
BF16 |
Throughput |
2 |
A10G |
24 |
BF16 |
Latency |
8 |
A10G |
24 |
BF16 |
Throughput |
4 |
Non-optimized Configuration#
For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.
Important
Requires compute capability >= 7.0 (8.0 for bfloat16).
The GPU Memory and Disk Space values are in GB
GPU Memory |
Precision |
Disk Space |
---|---|---|
60 |
BF16 |
50 |
Important
Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.
Llama-3.2-90B-Vision-Instruct#
Overview#
The Meta Llama 3.2 Vision collection of multimodal large language models (LLMs) is a collection of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2 Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks. Llama 3.2 Vision models are ready for commercial use.
The 90B model is recommended for users who want to prioritize model accuracy and have a high compute budget.
Important
This model only takes requests with a single image, and images larger than 1120x1120px will get scaled down.
Important
This model does not support tool use.
Optimized Configurations#
NVIDIA recommends at least 200GB disk space for the container and model.
The GPU Memory values are in GB; The Profile indicates what the model is optimized for.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
---|---|---|---|---|
H200 SXM |
141 |
BF16 |
Latency |
4 |
H200 SXM |
141 |
FP8 |
Latency |
2 |
H200 SXM |
141 |
BF16 |
Throughput |
2 |
H200 SXM |
141 |
FP8 |
Throughput |
1 |
H100 SXM |
80 |
BF16 |
Latency |
8 |
H100 SXM |
80 |
FP8 |
Latency |
4 |
H100 SXM |
80 |
BF16 |
Throughput |
4 |
H100 SXM |
80 |
FP8 |
Throughput |
2 |
A100 SXM |
80 |
BF16 |
Latency |
8 |
A100 SXM |
80 |
BF16 |
Throughput |
4 |
H100 PCIe |
80 |
BF16 |
Latency |
8 |
H100 PCIe |
80 |
FP8 |
Latency |
4 |
H100 PCIe |
80 |
BF16 |
Throughput |
4 |
H100 PCIe |
80 |
FP8 |
Throughput |
2 |
A100 PCIe |
80 |
BF16 |
Latency |
8 |
A100 PCIe |
80 |
BF16 |
Throughput |
4 |
L40S |
48 |
BF16 |
Throughput |
8 |
Non-optimized Configuration#
For GPU configurations not listed above, NIM for VLMs offers competitive performance through a custom vLLM backend. Any NVIDIA GPU with sufficient memory, or multiple homogeneous NVIDIA GPUs with sufficient aggregate memory, should be able to run this model, though this is not guaranteed.
Important
Requires compute capability >= 7.0 (8.0 for bfloat16).
The GPU Memory values are in GB; Profile is for what the model is optimized.
GPU Memory |
Precision |
Disk Space |
---|---|---|
240 |
BF16 |
200 |
Important
Try reducing the model’s context length by setting the environment variable NIM_MAX_MODEL_LEN to a smaller value (e.g., 32,768) when launching NIM if there is not enough space for KV cache of full-sized sequence.