Support Matrix#

Hardware#

NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations, including vGPU, are optimized. See the following Supported Models section for further information.

Software#

Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1

GPUs#

The GPU listed in the following sections have the following specifications.

GPU	Family	Memory
H100	SXM/NVLink	80GB
A100	SXM/NVLink	80GB
L40S	PCIe	48GB
A10G	PCIe	24GB

General Guidelines#

In general, NVIDIA recommends the following guidelines for models that NVIDIA NIMs support, but have not been either optimized for our TRT-LLM runtime nor tested against all of our GPUs in our lab. The values in these two tables are based on the number of parameters used during training.

Note

These values are estimates not guarantees.

GPUs#

Both H100 and A100 should be 80GB SXM/NVLink models, L40S should be 48GB PCIe models, and A10G should be 24GB PCIe models.

Billion Parameters	H100	A100	L40S	A10G
8 or fewer	1	1	1	1
8 to 70	1	1	2	4
70 to 300	4	4	8	16
300+	8	8	16	32

Disk Space#

In general you can expect the vLLM runtime and a model to take up about 4X the billions of parameters in GB. Therefore, given a 400B model and vLLM runtime, the combination should occupy about 1.6TB of disk space.

Supported Models#

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.

Llama 3 Swallow 70B Instruct V0.1#

Optimized configurations#

The Profile is for what the model is optimized; **LoRA is whether the model supports LoRA.

GPU	Precision	Profile	# of GPUs	LoRA
A100	fp16	Latency	8
A100	fp16	Throughput	4
A100	fp16	Throughput	4	Y
H100	fp8	Latency	8
H100	fp16	Latency	8
H100	fp8	Throughput	4
H100	fp16	Throughput	4
H100	fp16	Throughput	4	Y
L40S	fp8	Latency	8
L40S	fp16	Throughput	8
L40S	fp8	Throughput	4
L40S	fp16	Throughput	8	Y
A10G	fp16	Throughput	8

Llama 3 Taiwan 70B Instruct#

Optimized configurations#

The Profile is for what the model is optimized; **LoRA is whether the model supports LoRA.

GPU	Precision	Profile	# of GPUs	LoRA
A100	fp16	Latency	8
A100	fp16	Throughput	4
A100	fp16	Throughput	4	Y
H100	fp8	Latency	8
H100	fp16	Latency	8
H100	fp8	Throughput	4
H100	fp16	Throughput	4
H100	fp16	Throughput	4	Y
L40S	fp8	Latency	8
L40S	fp16	Throughput	8
L40S	fp8	Throughput	4
L40S	fp16	Throughput	8	Y
A10G	fp16	Throughput	8

Llama 3.1 8B Base#

Optimized configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100	BF16	Latency	2
H100	FP8	Latency	2
H100	BF16	Throughput	1
H100	FP8	Throughput	1
H100	BF16	Throughput	1
A100	BF16	Latency	2
A100	BF16	Throughput	1
A100	BF16	Throughput	1
L40S	BF16	Latency	2
L40S	BF16	Throughput	2
L40S	BF16	Throughput	2
A10G	BF16	Latency	4
A10G	BF16	Throughput	2
A10G	BF16	Throughput	4

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory	24	FP16	15

Llama 3.1 8B Instruct#

Optimized configurations#

NVIDIA recommends at least 50GB disk space for the container and model.

The Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100	BF16	Latency	2
H100	FP8	Latency	2
H100	BF16	Throughput	1
H100	FP8	Throughput	1
H100	BF16	Throughput	1
A100	BF16	Latency	2
A100	BF16	Throughput	1
A100	BF16	Throughput	1
L40S	BF16	Latency	2
L40S	BF16	Throughput	2
L40S	BF16	Throughput	1
L40S	BF16	Throughput	2
A10G	BF16	Latency	4
A10G	BF16	Throughput	2
A10G	BF16	Throughput	4

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory	24	FP16	15

Llama 3.1 70B Instruct#

NVIDIA recommends at least 350GB disk space for the container and model.

Optimized configurations#

The Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100	BF16	Latency	8
H100	FP8	Latency	8
H100	BF16	Throughput	4
H100	FP8	Throughput	4
H100	BF16	Throughput	4
A100	BF16	Latency	8
A100	BF16	Throughput	4
A100	BF16	Throughput	4
L40S	BF16	Throughput	8
L40S	BF16	Throughput	8

Llama 3.1 405B Instruct#

NVIDIA recommends at least 1.5TB disk space for the container and model.

Note

Only optimized profiles are available for Llama 3.1 405B Instruct.

GPU	Precision	Profile	# of GPUs
H100	FP16	Latency	16
H100	FP8	Latency	16
H100	FP8	Throughput	8
A100	FP16	Latency	16

Meta-Llama-3-8B-Instruct#

Optimized configurations#

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP16	Throughput	1	28
H100	FP16	Latency	2	28
A100	FP16	Throughput	1	28
A100	FP16	Latency	2	28
L40S	FP8	Throughput	1	20.5
L40S	FP8	Latency	2	20.5
L40S	FP16	Throughput	1	28
A10G	FP16	Throughput	1	28
A10G	FP16	Latency	2	28

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Meta-Llama-3-70B-Instruct#

Optimized configurations#

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	4	82
H100	FP8	Latency	8	82
H100	FP16	Throughput	4	158
H100	FP16	Latency	8	158
A100	FP16	Throughput	4	158

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory	240	FP16	100

Mistral-7B-Instruct-v0.3#

Optimized configurations#

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Latency	2	7.16
H100	FP16	Latency	2	13.82
H100	FP8	Throughput	1	7.06
H100	FP16	Throughput	1	13.54
A100	FP16	Latency	2	13.82
A100	FP16	Throughput	1	13.54
L40S	FP8	Latency	2	7.14
L40S	FP16	Latency	2	13.82
L40S	FP8	Throughput	1	7.06
L40S	FP16	Throughput	1	13.54

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory	24	FP16	16

Mixtral-8x7B-v0.1#

Optimized configurations#

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Latency	4	7.16
H100	FP8	Throughput	2	7.06
H100	FP16	Latency	4	13.82
H100	FP16	Throughput	2	13.54
A100	FP16	Throughput	4	13.82
A100	FP16	Throughput	2	13.54
L40S	FP16	Throughput	4	13.82

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory	24	FP16	16

Mistral-NeMo-12B-Instruct#

Optimized configurations#

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

The GPU Memory values are in GB; the Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP16	Throughput	1	23.35
H100	FP16	Latency	2	25.14
H100	FP8	Latency	2	13.82
A100	FP16	Throughput	1	23.35
A100	FP16	Latency	2	25.14
L40S	FP16	Throughput	2	25.14
L40S	FP16	Latency	4	28.71
L40S	FP8	Throughput	2	13.83
L40S	FP8	Latency	4	15.01
A10G	FP16	Throughput	4	28.71
A10G	FP16	Latency	4	35.87

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mixtral-8x22B-v0.1#

Optimized configurations#

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	8	132.61
H100	Int4wo	Throughput	8	134.82
H100	FP16	Throughput	8	265.59
A100	FP16	Throughput	8	265.7

Nemotron 4 340B Instruct#

Optimized configurations#

The Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100	FP16	Latency	16
A100	FP16	Latency	16

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.