Support Matrix#
Hardware#
NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations are optimized. See the following Supported Models section for further information.
Software#
Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
Supported Models#
These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.
Llama 2 7B Chat#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink |
80 |
8 |
Latency |
2 |
6.66 |
H100 SXM/NVLink |
80 |
16 |
Latency |
2 |
12.93 |
H100 SXM/NVLink |
80 |
8 |
Throughput |
1 |
6.57 |
H100 SXM/NVLink |
80 |
16 |
Throughput |
1 |
12.62 |
H100 SXM/NVLink |
80 |
16 |
Throughput LoRA |
1 |
12.63 |
A100 SXM/NVLink |
80 |
16 |
Latency |
2 |
12.92 |
A100 SXM/NVLink |
80 |
16 |
Throughput |
1 |
15.54 |
A100 SXM/NVLink |
80 |
16 |
Throughput LoRA |
1 |
12.63 |
L40S PCIe |
48 |
8 |
Latency |
2 |
6.64 |
L40S PCIe |
48 |
16 |
Latency |
2 |
12.95 |
L40S PCIe |
48 |
8 |
Throughput |
1 |
6.57 |
L40S PCIe |
48 |
16 |
Throughput |
1 |
12.64 |
L40S PCIe |
48 |
16 |
Throughput LoRA |
1 |
12.65 |
Non-optimized configuration#
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)
Llama 2 13B Chat#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink |
80 |
8 |
Latency |
2 |
12.6 |
H100 SXM/NVLink |
80 |
16 |
Latency |
2 |
24.71 |
H100 SXM/NVLink |
80 |
16 |
Throughput |
1 |
24.33 |
H100 SXM/NVLink |
80 |
16 |
Throughput LoRA |
1 |
24.35 |
A100 SXM/NVLink |
80 |
16 |
Latency |
2 |
24.74 |
A100 SXM/NVLink |
80 |
16 |
Throughput |
2 |
24.34 |
A100 SXM/NVLink |
80 |
16 |
Throughput LoRA |
1 |
24.37 |
L40S PCIe |
48 |
8 |
Latency |
2 |
12.39 |
L40S PCIe |
48 |
16 |
Latency |
2 |
24.7 |
L40S PCIe |
48 |
8 |
Throughput |
1 |
12.49 |
L40S PCIe |
48 |
16 |
Throughput |
1 |
24.33 |
L40S PCIe |
48 |
16 |
Throughput LoRA |
1 |
24.37 |
Non-optimized configuration#
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)
Llama 2 70B Chat#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink |
80 |
16 |
Throughput |
4 |
130.52 |
H100 SXM/NVLink |
80 |
8 |
Latency |
4 |
65.36 |
H100 SXM/NVLink |
80 |
16 |
Latency |
8 |
133.18 |
H100 SXM/NVLink |
80 |
8 |
Throughput |
2 |
65.08 |
H100 SXM/NVLink |
80 |
16 |
Throughput LoRA |
4 |
130.6 |
A100 SXM/NVLink |
80 |
16 |
Latency |
4 |
133.12 |
A100 SXM/NVLink |
80 |
16 |
Throughput |
4 |
130.52 |
A100 SXM/NVLink |
80 |
16 |
Throughput LoRA |
4 |
130.5 |
L40S PCIe |
48 |
8 |
Throughput |
4 |
63.35 |
Non-optimized configuration#
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)
Meta-Llama-3-8B-Instruct#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 |
80 |
FP16 |
Throughput |
1 |
28 |
H100 |
80 |
FP16 |
Latency |
2 |
28 |
A100 |
80 |
FP16 |
Throughput |
1 |
28 |
A100 |
80 |
FP16 |
Latency |
2 |
28 |
L40S PCIe |
48 |
FP8 |
Throughput |
1 |
20.5 |
L40S PCIe |
48 |
FP8 |
Latency |
2 |
20.5 |
L40S PCIe |
48 |
FP16 |
Throughput |
1 |
28 |
A10G PCIe |
24 |
FP16 |
Throughput |
1 |
28 |
A10G PCIe |
24 |
FP16 |
Latency |
2 |
28 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) |
24 |
FP16 |
16 |
Meta-Llama-3-70B-Instruct#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 |
80 |
FP8 |
Throughput |
4 |
82 |
H100 |
80 |
FP8 |
Latency |
8 |
82 |
H100 |
80 |
FP16 |
Throughput |
4 |
158 |
H100 |
80 |
FP16 |
Latency |
8 |
158 |
A100 |
80 |
FP16 |
Throughput |
4 |
158 |
L40S PCIe |
48 |
FP8 |
Throughput |
4 |
82 |
L40S PCIe |
48 |
FP8 |
Latency |
8 |
82 |
L40S PCIe |
48 |
FP16 |
Throughput |
8 |
158 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) |
240 |
FP16 |
100 |
Mistral-7B-Instruct-v0.3#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink |
80 |
FP8 |
Latency |
2 |
7.16 |
H100 SXM/NVLink |
80 |
FP16 |
Latency |
2 |
13.82 |
H100 SXM/NVLink |
80 |
FP8 |
Throughput |
1 |
7.06 |
H100 SXM/NVLink |
80 |
FP16 |
Throughput |
1 |
13.54 |
A100 SXM/NVLink |
80 |
FP16 |
Latency |
2 |
13.82 |
A100 SXM/NVLink |
80 |
FP16 |
Throughput |
1 |
13.54 |
L40S PCIe |
48 |
FP8 |
Latency |
2 |
7.14 |
L40S PCIe |
48 |
FP16 |
Latency |
2 |
13.82 |
L40S PCIe |
48 |
FP8 |
Throughput |
1 |
7.06 |
L40S PCIe |
48 |
FP16 |
Throughput |
1 |
13.54 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) |
24 |
FP16 |
16 |
Mixtral-8x7B-v0.1#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink |
80 |
FP8 |
Latency |
4 |
7.16 |
H100 SXM/NVLink |
80 |
FP8 |
Throughput |
2 |
7.06 |
H100 SXM/NVLink |
80 |
FP16 |
Latency |
4 |
13.82 |
H100 SXM/NVLink |
80 |
FP16 |
Throughput |
2 |
13.54 |
A100 SXM/NVLink |
80 |
FP16 |
Throughput |
4 |
13.82 |
A100 SXM/NVLink |
80 |
FP16 |
Throughput |
2 |
13.54 |
L40S PCIe |
48 |
FP16 |
Throughput |
4 |
13.82 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) |
24 |
FP16 |
16 |
Mixtral-8x22B-v0.1#
Optimized configurations#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU |
GPU Memory |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|---|
H100 SXM/NVLink |
80 |
FP8 |
Throughput |
8 |
132.61 |
H100 SXM/NVLink |
80 |
Int4wo |
Throughput |
8 |
134.82 |
H100 SXM/NVLink |
80 |
FP16 |
Throughput |
8 |
265.59 |
A100 SXM/NVLink |
80 |
FP16 |
Throughput |
8 |
265.7 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory |
240 |
FP16 |
100 |
Supported LoRA formats#
The following LoRA formats are supported:
Foundation Model |
HuggingFace Format |
NeMo Format |
---|---|---|
Meta-Llama3-8b-Instruct |
Yes |
Yes |
Meta-Llama3-70b-Instruct |
Yes |
Yes |