Support Matrix
NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations are optimized. See the following Supported Models section for further information.
Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.
Llama 2 7B Chat
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | 8 | Latency | 2 | 6.66 |
H100 SXM/NVLink | 80 | 16 | Latency | 2 | 12.93 |
H100 SXM/NVLink | 80 | 8 | Throughput | 1 | 6.57 |
H100 SXM/NVLink | 80 | 16 | Throughput | 1 | 12.62 |
H100 SXM/NVLink | 80 | 16 | Throughput LoRA | 1 | 12.63 |
A100 SXM/NVLink | 80 | 16 | Latency | 2 | 12.92 |
A100 SXM/NVLink | 80 | 16 | Throughput | 1 | 15.54 |
A100 SXM/NVLink | 80 | 16 | Throughput LoRA | 1 | 12.63 |
L40S PCIe | 48 | 8 | Latency | 2 | 6.64 |
L40S PCIe | 48 | 16 | Latency | 2 | 12.95 |
L40S PCIe | 48 | 8 | Throughput | 1 | 6.57 |
L40S PCIe | 48 | 16 | Throughput | 1 | 12.64 |
L40S PCIe | 48 | 16 | Throughput LoRA | 1 | 12.65 |
Non-optimized configuration
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)
Llama 2 13B Chat
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | 8 | Latency | 2 | 12.6 |
H100 SXM/NVLink | 80 | 16 | Latency | 2 | 24.71 |
H100 SXM/NVLink | 80 | 16 | Throughput | 1 | 24.33 |
H100 SXM/NVLink | 80 | 16 | Throughput LoRA | 1 | 24.35 |
A100 SXM/NVLink | 80 | 16 | Latency | 2 | 24.74 |
A100 SXM/NVLink | 80 | 16 | Throughput | 2 | 24.34 |
A100 SXM/NVLink | 80 | 16 | Throughput LoRA | 1 | 24.37 |
L40S PCIe | 48 | 8 | Latency | 2 | 12.39 |
L40S PCIe | 48 | 16 | Latency | 2 | 24.7 |
L40S PCIe | 48 | 8 | Throughput | 1 | 12.49 |
L40S PCIe | 48 | 16 | Throughput | 1 | 24.33 |
L40S PCIe | 48 | 16 | Throughput LoRA | 1 | 24.37 |
Non-optimized configuration
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)
Llama 2 70B Chat
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | 16 | Throughput | 4 | 130.52 |
H100 SXM/NVLink | 80 | 8 | Latency | 4 | 65.36 |
H100 SXM/NVLink | 80 | 16 | Latency | 8 | 133.18 |
H100 SXM/NVLink | 80 | 8 | Throughput | 2 | 65.08 |
H100 SXM/NVLink | 80 | 16 | Throughput LoRA | 4 | 130.6 |
A100 SXM/NVLink | 80 | 16 | Latency | 4 | 133.12 |
A100 SXM/NVLink | 80 | 16 | Throughput | 4 | 130.52 |
A100 SXM/NVLink | 80 | 16 | Throughput LoRA | 4 | 130.5 |
L40S PCIe | 48 | 8 | Throughput | 4 | 63.35 |
Non-optimized configuration
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)
Meta-Llama-3-8B-Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 | 80 | FP16 | Throughput | 1 | 28 |
H100 | 80 | FP16 | Latency | 2 | 28 |
A100 | 80 | FP16 | Throughput | 1 | 28 |
A100 | 80 | FP16 | Latency | 2 | 28 |
L40S PCIe | 48 | FP8 | Throughput | 1 | 20.5 |
L40S PCIe | 48 | FP8 | Latency | 2 | 20.5 |
L40S PCIe | 48 | FP16 | Throughput | 1 | 28 |
A10G PCIe | 24 | FP16 | Throughput | 1 | 28 |
A10G PCIe | 24 | FP16 | Latency | 2 | 28 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) | 24 | FP16 | 16 |
Meta-Llama-3-70B-Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 | 80 | FP8 | Throughput | 4 | 82 |
H100 | 80 | FP8 | Latency | 8 | 82 |
H100 | 80 | FP16 | Throughput | 4 | 158 |
H100 | 80 | FP16 | Latency | 8 | 158 |
A100 | 80 | FP16 | Throughput | 4 | 158 |
L40S PCIe | 48 | FP8 | Throughput | 4 | 82 |
L40S PCIe | 48 | FP8 | Latency | 8 | 82 |
L40S PCIe | 48 | FP16 | Throughput | 8 | 158 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) | 240 | FP16 | 100 |
Mistral-7B-Instruct-v0.3
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Latency | 2 | 7.16 |
H100 SXM/NVLink | 80 | FP16 | Latency | 2 | 13.82 |
H100 SXM/NVLink | 80 | FP8 | Throughput | 1 | 7.06 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 13.54 |
A100 SXM/NVLink | 80 | FP16 | Latency | 2 | 13.82 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 1 | 13.54 |
L40S PCIe | 48 | FP8 | Latency | 2 | 7.14 |
L40S PCIe | 48 | FP16 | Latency | 2 | 13.82 |
L40S PCIe | 48 | FP8 | Throughput | 1 | 7.06 |
L40S PCIe | 48 | FP16 | Throughput | 1 | 13.54 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) | 24 | FP16 | 16 |
Mixtral-8x7B-v0.1
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Latency | 4 | 7.16 |
H100 SXM/NVLink | 80 | FP8 | Throughput | 2 | 7.06 |
H100 SXM/NVLink | 80 | FP16 | Latency | 4 | 13.82 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 2 | 13.54 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 4 | 13.82 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 2 | 13.54 |
L40S PCIe | 48 | FP16 | Throughput | 4 | 13.82 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) | 24 | FP16 | 16 |
Mixtral-8x22B-v0.1
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink | 80 | FP8 | Throughput | 8 | 132.61 |
H100 SXM/NVLink | 80 | Int4wo | Throughput | 8 | 134.82 |
H100 SXM/NVLink | 80 | FP16 | Throughput | 8 | 265.59 |
A100 SXM/NVLink | 80 | FP16 | Throughput | 8 | 265.7 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory | 240 | FP16 | 100 |
The following LoRA formats are supported:
Foundation Model |
HuggingFace Format |
NeMo Format |
---|---|---|
Meta-Llama3-8b-Instruct | Yes | Yes |
Meta-Llama3-70b-Instruct | Yes | Yes |