Support Matrix#

Hardware#

NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations are optimized. See the following Supported Models section for further information.

Software#

Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1

Supported Models#

These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Llama 2 7B Chat#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100 SXM/NVLink	80	8	Latency	2	6.66
H100 SXM/NVLink	80	16	Latency	2	12.93
H100 SXM/NVLink	80	8	Throughput	1	6.57
H100 SXM/NVLink	80	16	Throughput	1	12.62
H100 SXM/NVLink	80	16	Throughput LoRA	1	12.63
A100 SXM/NVLink	80	16	Latency	2	12.92
A100 SXM/NVLink	80	16	Throughput	1	15.54
A100 SXM/NVLink	80	16	Throughput LoRA	1	12.63
L40S PCIe	48	8	Latency	2	6.64
L40S PCIe	48	16	Latency	2	12.95
L40S PCIe	48	8	Throughput	1	6.57
L40S PCIe	48	16	Throughput	1	12.64
L40S PCIe	48	16	Throughput LoRA	1	12.65

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Llama 2 13B Chat#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100 SXM/NVLink	80	8	Latency	2	12.6
H100 SXM/NVLink	80	16	Latency	2	24.71
H100 SXM/NVLink	80	16	Throughput	1	24.33
H100 SXM/NVLink	80	16	Throughput LoRA	1	24.35
A100 SXM/NVLink	80	16	Latency	2	24.74
A100 SXM/NVLink	80	16	Throughput	2	24.34
A100 SXM/NVLink	80	16	Throughput LoRA	1	24.37
L40S PCIe	48	8	Latency	2	12.39
L40S PCIe	48	16	Latency	2	24.7
L40S PCIe	48	8	Throughput	1	12.49
L40S PCIe	48	16	Throughput	1	24.33
L40S PCIe	48	16	Throughput LoRA	1	24.37

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Llama 2 70B Chat#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100 SXM/NVLink	80	16	Throughput	4	130.52
H100 SXM/NVLink	80	8	Latency	4	65.36
H100 SXM/NVLink	80	16	Latency	8	133.18
H100 SXM/NVLink	80	8	Throughput	2	65.08
H100 SXM/NVLink	80	16	Throughput LoRA	4	130.6
A100 SXM/NVLink	80	16	Latency	4	133.12
A100 SXM/NVLink	80	16	Throughput	4	130.52
A100 SXM/NVLink	80	16	Throughput LoRA	4	130.5
L40S PCIe	48	8	Throughput	4	63.35

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Meta-Llama-3-8B-Instruct#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100	80	FP16	Throughput	1	28
H100	80	FP16	Latency	2	28
A100	80	FP16	Throughput	1	28
A100	80	FP16	Latency	2	28
L40S PCIe	48	FP8	Throughput	1	20.5
L40S PCIe	48	FP8	Latency	2	20.5
L40S PCIe	48	FP16	Throughput	1	28
A10G PCIe	24	FP16	Throughput	1	28
A10G PCIe	24	FP16	Latency	2	28

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)	24	FP16	16

Meta-Llama-3-70B-Instruct#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100	80	FP8	Throughput	4	82
H100	80	FP8	Latency	8	82
H100	80	FP16	Throughput	4	158
H100	80	FP16	Latency	8	158
A100	80	FP16	Throughput	4	158
L40S PCIe	48	FP8	Throughput	4	82
L40S PCIe	48	FP8	Latency	8	82
L40S PCIe	48	FP16	Throughput	8	158

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)	240	FP16	100

Mistral-7B-Instruct-v0.3#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100 SXM/NVLink	80	FP8	Latency	2	7.16
H100 SXM/NVLink	80	FP16	Latency	2	13.82
H100 SXM/NVLink	80	FP8	Throughput	1	7.06
H100 SXM/NVLink	80	FP16	Throughput	1	13.54
A100 SXM/NVLink	80	FP16	Latency	2	13.82
A100 SXM/NVLink	80	FP16	Throughput	1	13.54
L40S PCIe	48	FP8	Latency	2	7.14
L40S PCIe	48	FP16	Latency	2	13.82
L40S PCIe	48	FP8	Throughput	1	7.06
L40S PCIe	48	FP16	Throughput	1	13.54

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16)	24	FP16	16

Mixtral-8x7B-v0.1#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100 SXM/NVLink	80	FP8	Latency	4	7.16
H100 SXM/NVLink	80	FP8	Throughput	2	7.06
H100 SXM/NVLink	80	FP16	Latency	4	13.82
H100 SXM/NVLink	80	FP16	Throughput	2	13.54
A100 SXM/NVLink	80	FP16	Throughput	4	13.82
A100 SXM/NVLink	80	FP16	Throughput	2	13.54
L40S PCIe	48	FP16	Throughput	4	13.82

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16)	24	FP16	16

Mixtral-8x22B-v0.1#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100 SXM/NVLink	80	FP8	Throughput	8	132.61
H100 SXM/NVLink	80	Int4wo	Throughput	8	134.82
H100 SXM/NVLink	80	FP16	Throughput	8	265.59
A100 SXM/NVLink	80	FP16	Throughput	8	265.7

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory	240	FP16	100

Supported LoRA formats#

The following LoRA formats are supported:

Foundation Model	HuggingFace Format	NeMo Format
Meta-Llama3-8b-Instruct	Yes	Yes
Meta-Llama3-70b-Instruct	Yes	Yes