Support Matrix#

Hardware#

NVIDIA NIMs for large-language models should, but are not guaranteed to, run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations, including vGPU, are optimized. See the following Supported Models section for further information.

Software#

Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1

GPUs#

The GPU listed in the following sections have the following specifications.

GPU	Family	Memory
H100	SXM/NVLink	80GB
A100	SXM/NVLink	80GB
L40S	PCIe	48GB
A10G	PCIe	24GB

Supported Models#

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.

Code Llama 13B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP16	Throughput	2	24.63
H100	FP16	Latency	4	25.32
A100	FP16	Throughput	2	24.63
A100	FP16	Latency	4	25.31
L40S	FP16	Throughput	2	25.32
L40S	FP16	Latency	2	24.63
A10G	FP16	Throughput	4	25.32
A10G	FP16	Latency	8	26.69

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x									x			x

Code Llama 34B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	1	6.57
H100	FP8	Latency	2	6.66
H100	FP16	Throughput	1	12.62
H100	FP16	Throughput LoRA	1	12.63
H100	FP16	Latency	2	12.93
A100	FP16	Throughput	1	15.54
A100	FP16	Throughput LoRA	1	12.63
A100	FP16	Latency	2	12.92
L40S	FP8	Throughput	1	6.57
L40S	FP8	Latency	2	6.64
L40S	FP16	Throughput	1	12.64
L40S	FP16	Throughput LoRA	1	12.65
L40S	FP16	Latency	2	12.95

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x									x			x

Code Llama 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	fp8	throughput	4	65.47
H100	fp8	Latency	8	66.37
H100	fp16	throughput	4	130.35
H100	fp16	latency	8	66.37
A100	fp16	throughput	4	130.35
A100	fp16	latency	8	132.71
A10G	fp16	throughput	8	132.69

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x									x			x

(Meta) Llama 2 7B Chat#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	1	6.57
H100	FP8	Latency	2	6.66
H100	FP16	Throughput	1	12.62
H100	FP16	Throughput LoRA	1	12.63
H100	FP16	Latency	2	12.93
A100	FP16	Throughput	1	15.54
A100	FP16	Throughput LoRA	1	12.63
A100	FP16	Latency	2	12.92
L40S	FP8	Throughput	1	6.57
L40S	FP8	Latency	2	6.64
L40S	FP16	Throughput	1	12.64
L40S	FP16	Throughput LoRA	1	12.65
L40S	FP16	Latency	2	12.95

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x	x			x

(Meta) Llama 2 13B Chat#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Latency	2	12.6
H100	FP16	Throughput	1	24.33
H100	FP16	Throughput LoRA	1	24.35
H100	FP16	Latency	2	24.71
A100	FP16	Throughput	1	24.34
A100	FP16	Throughput LoRA	1	24.37
A100	FP16	Latency	2	24.74
L40S	FP8	Throughput	1	12.49
L40S	FP8	Latency	2	12.59
L40S	FP16	Throughput	1	24.33
L40S	FP16	Latency	2	24.7
L40S	FP16	Throughput LoRA	1	24.37

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x	x			x

(Meta) Llama 2 70B Chat#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	2	65.08
H100	FP8	Latency	4	65.36
H100	FP16	Throughput	4	130.52
H100	FP16	Throughput LoRA	4	130.6
H100	FP16	Latency	8	133.18
A100	FP16	Throughput	4	130.52
A100	FP16	Throughput LoRA	4	130.5
A100	FP16	Latency	8	133.12
L40S	FP8	Throughput	4	63.35

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x	x			x

Llama 3 SQLCoder 8B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk space
H100	fp8	throughput	1	15
H100	fp8	latency	2	16.02
H100	fp16	throughput	1	8.52
H100	fp16	latency	2	8.61
A100	fp16	throughput	1	15
A100	fp16	latency	2	16.02
L40S	fp8	throughput	1	15
L40S	fp8	latency	2	16.02
L40S	fp16	throughput	1	8.53
L40S	fp16	latency	2	8.61
A10G	fp16	throughput	1	15
A10G	fp16	throughput	2	16.02
A10G	fp16	latency	4	18.06

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
												x

Llama 3 Swallow 70B Instruct V0.1#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	2	68.42
H100	FP8	Latency	4	69.3
H100	FP16	Throughput	2	137.7
H100	FP16	Latency	4	145.94
A100	FP16	Throughput	2	137.7
A100	FP16	Latency	2	137.7
L40S	FP8	Throughput	2	68.48
A10G	FP16	Throughput	4	145.93

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x			x

Llama 3 Taiwan 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	2	68.42
H100	FP8	Latency	4	145.94
H100	FP16	Throughput	2	137.7
H100	FP16	Latency	4	137.7
A100	FP16	Throughput	2	137.7
A100	FP16	Latency	2	145.94
L40S	FP8	Throughput	2	68.48
A10G	FP16	Throughput	4	145.93

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x			x

Llama 3.1 8B Base#

Optimized configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100	BF16	Latency	2
H100	FP8	Latency	2
H100	BF16	Throughput	1
H100	FP8	Throughput	1
A100	BF16	Latency	2
A100	BF16	Throughput	1
L40S	BF16	Latency	2
L40S	BF16	Throughput	2
A10G	BF16	Latency	4
A10G	BF16	Throughput	2

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	15

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x	x	x	x

Llama 3.1 8B Instruct#

Optimized configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100	BF16	Latency	2
H100	FP8	Latency	2
H100	BF16	Throughput	1
H100	FP8	Throughput	1
A100	BF16	Latency	2
A100	BF16	Throughput	1
L40S	BF16	Latency	2
L40S	BF16	Throughput	1
A10G	BF16	Latency	4
A10G	BF16	Throughput	2

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	15

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x	x	x	x	x			x

Llama 3.1 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100	FP8	Throughput	4
H100	FP8	Latency	8
H100	BF16	Throughput	4
H100	BF16	Latency	8
A100	BF16	Throughput	4
A100	BF16	Latency	8
L40S	BF16	Throughput	8

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x	x	x	x	x		x

Llama 3.1 405B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	8	487
H100	FP16	Latency	8	797
A100	FP16	Latency	16	697

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	240	FP16	100

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x	x		x	x	x

Llama 3.1 Nemotron 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk space
H100	fp8	throughput	2	68.18
H100	fp8	throughput	4	68.64
H100	fp8	latency	8	69.77
H100	fp16	throughput	4	137.94
H100	fp16	latency	8	146.41
A100	fp16	throughput	4	137.93
A100	fp16	latency	8	146.41

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x	x	x	x	x		x	x

Meta Llama 3 8B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP16	Throughput	1	28
H100	FP16	Latency	2	28
A100	FP16	Throughput	1	28
A100	FP16	Latency	2	28
L40S	FP8	Throughput	1	20.5
L40S	FP8	Latency	2	20.5
L40S	FP16	Throughput	1	28
A10G	FP16	Throughput	1	28
A10G	FP16	Latency	2	28

Non-optimized configuration#

The Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x	x	x		x

Meta Llama 3 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	4	82
H100	FP8	Latency	8	82
H100	FP16	Throughput	4	158
H100	FP16	Latency	8	158
A100	FP16	Throughput	4	158

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	240	FP16	100

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x	x	x	x	x

Mistral 7B Instruct V0.3#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	1	7.06
H100	FP8	Latency	2	7.16
H100	FP16	Throughput	1	13.54
H100	FP16	Latency	2	13.82
A100	FP16	Throughput	1	13.54
A100	FP16	Latency	2	13.82
L40S	FP8	Throughput	1	7.06
L40S	FP8	Latency	2	7.14
L40S	FP16	Throughput	1	13.54
L40S	FP16	Latency	2	13.82

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x			x

Mistral NeMo Minitron 8B 8K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	fp8	throughput	1	8.91
H100	fp8	latency	2	9.03
H100	fp16	throughput	1	15.72 GB
H100	fp16	latency	2	16.78 GB
A100	fp16	throughput	1	15.72
A100	fp16	latency	2	16.78
L40S	fp8	throughput	1	8.92
L40S	fp8	latency	2	9.02
L40S	fp16	throughput	1	15.72
L40S	fp16	latency	2	16.77
A10G	fp16	throughput	2	16.81
A10G	fp16	latency	4	15.72

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
												x

Mistral NeMo 12B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Latency	2	13.82
H100	FP16	Throughput	1	23.35
H100	FP16	Latency	2	25.14
A100	FP16	Throughput	1	23.35
A100	FP16	Latency	2	25.14
L40S	FP8	Throughput	2	13.83
L40S	FP8	Latency	4	15.01
L40S	FP16	Throughput	2	25.14
L40S	FP16	Latency	4	28.71
A10G	FP16	Throughput	4	28.71
A10G	FP16	Latency	8	35.87

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x									x			x

Mixtral 8x7B Instruct V0.1#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	2	43.91
H100	FP8	Latency	4	44.07
H100	FP16	Throughput	2	87.35
H100	FP16	Latency	4	87.95
A100	FP16	Throughput	2	87.35
L40S	FP16	Throughput	4	87.95

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x		x							x		x

Mixtral 8x22B Instruct V0.1#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP8	Throughput	8	132.61
H100	FP8	Latency	8	132.56
H100	int8wo	Throughput	8	134.82
H100	int8wo	Latency	8	132.31
H100	FP16	Throughput	8	265.59
A100	FP16	Throughput	8	265.7

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x		x							x			x

Nemotron 4 340B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP16	Latency	16	627
A100	FP16	Latency	16	627

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x					x			x

Nemotron 4 340B Instruct 128K#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP16	Latency	16	637
A100	FP16	Latency	16	637

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
												x

Nemotron 4 340B Reward#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	FP16	Latency	16	637
A100	FP16	Latency	16	637

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
x									x	x

Phi 3 Mini 4K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	fp8	Throughput	1	3.8
H100	fp16	Throughput	1	7.14
A100	fp16	Throughput	1	7.14
L40S	fp8	Throughput	1	3.8
L40S	fp16	Throughput	1	7.14
A10G	fp16	Throughput	1	7.14

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
												x

Phind Codellama 34B V2#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	fp8	throughput	2	32.17
H100	fp8	latency	4	32.41
H100	fp16	throughput	2	63.48
H100	fp16	latency	4	64.59
A100	fp16	throughput	2	63.48
A100	fp16	latency	4	64.59
L40S	fp8	throughput	4	32.43
L40S	fp16	throughput	4	64.58
A10G	fp16	throughput	4	64.58
A10G	fp16	latency	8	66.8

Non-optimized configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1	1.0	1.0.0	1.0.1	1.0.3	1.1	1.1.0	1.1.1	1.1.2	1.2	1.2.0	1.2.1	1.2.3
												x