Support Matrix
NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations, including vGPU, are optimized. See the following Supported Models section for further information.
Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.
Llama 3.1 8B Base
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Latency | 2 | 10.81 |
H100 SXM/NVLink | 80 | FP16 | Latency | 2 | 16.12 |
H100 SXM/NVLink | 80 | FP8 | Throughput | 1 | 9.76 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.04 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.05 |
A100 SXM/NVLink | 80 | FP16 | Latency | 2 | 16.27 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.1 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.11 |
L40S PCIe | 48 | FP16 | Latency | 2 | 16.18 |
L40S PCIe | 48 | FP16 | Throughput | 1 | 16.18 |
L40S PCIe | 48 | FP16 | Throughput | 1 | 16.16 |
A10G | 24 | FP16 | Latency | 4 | 18.41 |
A10G | 24 | FP16 | Throughput | 2 | 16.21 |
A10G | 24 | FP16 | Throughput | 2 | 18.39 |
Llama 3.1 8B Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Latency | 2 | 10.81 |
H100 SXM/NVLink | 80 | FP16 | Latency | 2 | 16.11 |
H100 SXM/NVLink | 80 | FP8 | Throughput | 1 | 9.76 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.04 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.05 |
A100 SXM/NVLink | 80 | FP16 | Latency | 2 | 16.28 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.1 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 15.11 |
L40S PCIe | 48 | FP16 | Latency | 2 | 16.18 |
L40S PCIe | 48 | FP16 | Throughput | 2 | 16.17 |
L40S PCIe | 48 | FP16 | Throughput | 2 | 16.16 |
A10G | 24 | FP16 | Latency | 4 | 18.41 |
A10G | 24 | FP16 | Throughput | 2 | 16.21 |
A10G | 24 | FP16 | Throughput | 2 | 18.41 |
Llama 3.1 70B Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP16 | Latency | 8 | 146.98 |
H100 SXM/NVLink | 80 | FP8 | Latency | 8 | 94.01 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 138.17 |
H100 SXM/NVLink | 80 | FP8 | Throughput | 4 | 85.44 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 138.53 |
A100 SXM/NVLink | 80 | FP16 | Latency | 8 | 148.11 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 138.81 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 138.85 |
Meta-Llama-3-8B-Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 28 |
H100 SXM/NVLink | 80 | FP16 | Latency | 2 | 28 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 28 |
A100 SXM/NVLink | 80 | FP16 | Latency | 2 | 28 |
L40S PCIe | 48 | FP8 | Throughput | 1 | 20.5 |
L40S PCIe | 48 | FP8 | Latency | 2 | 20.5 |
L40S PCIe | 48 | FP16 | Throughput | 1 | 28 |
A10G PCIe | 24 | FP16 | Throughput | 1 | 28 |
A10G PCIe | 24 | FP16 | Latency | 2 | 28 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) | 24 | FP16 | 16 |
Meta-Llama-3-70B-Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Throughput | 4 | 82 |
H100 SXM/NVLink | 80 | FP8 | Latency | 8 | 82 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 158 |
H100 SXM/NVLink | 80 | FP16 | Latency | 8 | 158 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 158 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) | 240 | FP16 | 100 |
Mistral-7B-Instruct-v0.3
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Latency | 2 | 7.16 |
H100 SXM/NVLink | 80 | FP16 | Latency | 2 | 13.82 |
H100 SXM/NVLink | 80 | FP8 | Throughput | 1 | 7.06 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 13.54 |
A100 SXM/NVLink | 80 | FP16 | Latency | 2 | 13.82 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 13.54 |
L40S PCIe | 48 | FP8 | Latency | 2 | 7.14 |
L40S PCIe | 48 | FP16 | Latency | 2 | 13.82 |
L40S PCIe | 48 | FP8 | Throughput | 1 | 7.06 |
L40S PCIe | 48 | FP16 | Throughput | 1 | 13.54 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) | 24 | FP16 | 16 |
Mixtral-8x7B-v0.1
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Latency | 4 | 7.16 |
H100 SXM/NVLink | 80 | FP8 | Throughput | 2 | 7.06 |
H100 SXM/NVLink | 80 | FP16 | Latency | 4 | 13.82 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 2 | 13.54 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 13.82 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 2 | 13.54 |
L40S PCIe | 48 | FP16 | Throughput | 4 | 13.82 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) | 24 | FP16 | 16 |
Mixtral-8x22B-v0.1
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Throughput | 8 | 132.61 |
H100 SXM/NVLink | 80 | Int4wo | Throughput | 8 | 134.82 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 8 | 265.59 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 8 | 265.7 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory | 240 | FP16 | 100 |
The following LoRA formats are supported:
Foundation Model |
HuggingFace Format |
NeMo Format |
---|---|---|
Meta-Llama3-8b-Instruct | Yes | Yes |
Meta-Llama3-70b-Instruct | Yes | Yes |