Support Matrix#

Hardware#

CPU#

Requirements:

  • Linux operating system

  • x86 processor with at least 8 cores (modern processor recommended)

  • Memory requirements vary greatly depending on use case

For trtllm_buildable profiles the memory requirements can near the amount of memory used by GPUs.

GPU#

NVIDIA NIMs for large-language models should, but are not guaranteed to, run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16).

You can approximate the amount of required memory using the following guidelines. However, these guidelines do not apply to trtllm_buildable profiles:

  • 5–10 GB for OS and other processes

  • 16 GB for Docker (16 GB of shared memory is required by docker in multi-GPU, non-NVLink cases)

  • # model parameters * 2 GB of memory

    • Llama 8B: ~ 15 GB

    • Llama 70B: ~ 131 GB

    • Mistral 7B Instruct v0.3: ~ 14 GB

    • Mixtral 8x7B Instruct v0.1: ~ 88 GB

These recommendations are a rough guideline and actual memory required can be lower or higher depending on hardware and NIM configuration.

Some model/GPU combinations, including vGPU, are optimized. See the following Supported Models section for further information.

Software#

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 560

  • NVIDIA Docker >= 23.0.1

  • CUDA 12.6.1

If you are running on a data center GPU (for example, A100 or any other data center GPU), you can use NVIDIA driver release 470.57 (or later R470), 535.86 (or later R535), or 550.54 (or later R550)

GPUs#

The GPU listed in the following sections have the following specifications.

GPU

Family

Memory

H200

SXM/NVLink

141 GB

H100

SXM/NVLink

80 GB

A100

SXM/NVLink

80 GB

L40S

PCIe

48 GB

A10G

PCIe

24 GB

Supported Models#

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. Generic model profiles can be identified by the presence of local_build or vllm in the profile name. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps at Profile Selection.

Code Llama 13B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Throughput

2

24.63

H100 SXM

FP16

Latency

4

25.32

A100 SXM

FP16

Throughput

2

24.63

A100 SXM

FP16

Latency

4

25.31

L40S

FP16

Throughput

2

25.32

L40S

FP16

Latency

2

24.63

A10G

FP16

Throughput

4

25.32

A10G

FP16

Latency

8

26.69

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Code Llama 34B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

32.17

H100 SXM

FP8

Latency

4

32.42

H100 SXM

FP16

Throughput

2

63.48

H100 SXM

FP16

Latency

4

64.59

A100 SXM

FP16

Throughput

2

63.48

A100 SXM

FP16

Latency

4

64.59

L40S

FP8

Throughput

4

32.42

L40S

FP16

Throughput

4

64.58

A10G

FP16

Throughput

4

64.58

A10G

FP16

Latency

8

66.8

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Code Llama 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

4

65.47

H100 SXM

FP8

Latency

8

66.37

H100 SXM

FP16

Throughput

4

130.35

H100 SXM

FP16

Latency

8

66.37

A100 SXM

FP16

Throughput

4

130.35

A100 SXM

FP16

Latency

8

132.71

A10G

FP16

Throughput

8

132.69

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Gemma 2 2B#

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 1, 2

Gemma 2 9B#

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 1, 2, or 4

(Meta) Llama 2 7B Chat#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

6.57

H100 SXM

FP8

Latency

2

6.66

H100 SXM

FP16

Throughput

1

12.62

H100 SXM

FP16

Throughput LoRA

1

12.63

H100 SXM

FP16

Latency

2

12.93

A100 SXM

FP16

Throughput

1

15.54

A100 SXM

FP16

Throughput LoRA

1

12.63

A100 SXM

FP16

Latency

2

12.92

L40S

FP8

Throughput

1

6.57

L40S

FP8

Latency

2

6.64

L40S

FP16

Throughput

1

12.64

L40S

FP16

Throughput LoRA

1

12.65

L40S

FP16

Latency

2

12.95

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

(Meta) Llama 2 13B Chat#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Latency

2

12.6

H100 SXM

FP16

Throughput

1

24.33

H100 SXM

FP16

Throughput LoRA

1

24.35

H100 SXM

FP16

Latency

2

24.71

A100 SXM

FP16

Throughput

1

24.34

A100 SXM

FP16

Throughput LoRA

1

24.37

A100 SXM

FP16

Latency

2

24.74

L40S

FP8

Throughput

1

12.49

L40S

FP8

Latency

2

12.59

L40S

FP16

Throughput

1

24.33

L40S

FP16

Throughput LoRA

1

24.37

L40S

FP16

Latency

2

24.7

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

(Meta) Llama 2 70B Chat#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

65.08

H100 SXM

FP8

Latency

4

65.36

H100 SXM

FP16

Throughput

4

130.52

H100 SXM

FP16

Throughput LoRA

4

130.6

H100 SXM

FP16

Latency

8

133.18

A100 SXM

FP16

Throughput

4

130.52

A100 SXM

FP16

Throughput LoRA

4

130.5

A100 SXM

FP16

Latency

8

133.12

L40S

FP8

Throughput

4

63.35

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Llama 3 SQLCoder 8B#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk space

H100 SXM

FP8

Throughput

1

8.52

H100 SXM

FP8

Latency

2

8.61

H100 SXM

FP16

Throughput

1

15

H100 SXM

FP16

Latency

2

16.02

L40S

FP8

Throughput

1

8.53

L40S

FP8

Latency

2

8.61

L40S

FP16

Throughput

1

15

L40S

FP16

Latency

2

16.02

A10G

FP16

Throughput

1

15

A10G

FP16

Throughput

2

16.02

A10G

FP16

Latency

2

16.02

A10G

FP16

Latency

4

18.06

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Llama 3 Swallow 70B Instruct V0.1#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

68.42

H100 SXM

FP8

Latency

4

69.3

H100 SXM

FP16

Throughput

2

137.7

H100 SXM

FP16

Latency

4

145.94

A100 SXM

FP16

Throughput

2

137.7

A100 SXM

FP16

Latency

2

137.7

L40S

FP8

Throughput

2

68.48

A10G

FP16

Throughput

4

145.93

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Llama 3 Taiwan 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

68.42

H100 SXM

FP8

Latency

4

145.94

H100 SXM

FP16

Throughput

2

137.7

H100 SXM

FP16

Latency

4

137.7

A100 SXM

FP16

Throughput

2

137.7

A100 SXM

FP16

Latency

2

145.94

L40S

FP8

Throughput

2

68.48

A10G

FP16

Throughput

4

145.93

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Llama 3.1 8B Base#

Optimized configurations#

Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

H100 SXM

BF16

Latency

2

H100 SXM

FP8

Latency

2

H100 SXM

BF16

Throughput

1

H100 SXM

FP8

Throughput

1

A100 SXM

BF16

Latency

2

A100 SXM

BF16

Throughput

1

L40S

BF16

Latency

2

L40S

BF16

Throughput

2

A10G

BF16

Latency

4

A10G

BF16

Throughput

2

Generic configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

15

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

x

Llama 3.1 8B Instruct#

Optimized configurations#

Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

8.56

H100 SXM

FP8

Latency

2

8.66

H100 SXM

BF16

Throughput

1

15.06

H100 SXM

BF16

Latency

2

16.15

H100 NVL

FP8

Throughput

1

15.06

H100 NVL

FP8

Latency

2

8.74

H100 NVL

BF16

Throughput

1

8.57

H100 NVL

BF16

Latency

2

16.15

A100 SXM

BF16

Throughput

1

15.06

A100 SXM

BF16

Latency

2

16.15

L40S

BF16

Throughput

1

15.5

L40S

BF16

Throughput

2

16.15

L40S

BF16

Latency

4

18.31

A10G

FP16

Throughput

2

16.35

A10G

BF16

Latency

4

18.71

Generic configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

15

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

x

x

x

x

Llama 3.1 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H200 SXM

FP8

Throughput

1

67.87

H200 SXM

FP8

Latency

2

68.2

H200 SXM

BF16

Throughput

2

133.72

H200 SXM

BF16

Latency

4

137.99

H100 SXM

FP8

Throughput

2

68.2

H100 SXM

FP8

Throughput

4

68.72

H100 SXM

FP8

Latency

8

69.71

H100 SXM

BF16

Throughput

4

138.39

H100 SXM

BF16

Latency

8

147.66

H100 NVL

FP8

Throughput

2

68.2

H100 NVL

FP8

Latency

4

68.72

H100 NVL

BF16

Throughput

2

133.95

H100 NVL

BF16

Throughput

4

138.4

H100 NVL

BF16

Latency

8

147.37

A100 SXM

BF16

Throughput

4

138.53

A100 SXM

BF16

Latency

8

147.44

L40S

BF16

Throughput

4

138.49

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

x

x

x

x

Llama 3.1 405B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Latency

8

388.75

H100 SXM

FP16

Latency

16

794.9

A100 SXM

PP16

Latency

16

798.2

Generic configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

240

FP16

100 SXM

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

x

x

x

Llama 3.1 Nemotron 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk space

H100 SXM

FP8

Throughput

2

68.18

H100 SXM

FP8

Throughput

4

68.64

H100 SXM

FP8

Latency

8

69.77

H100 SXM

FP16

Throughput

4

137.94

H100 SXM

FP16

Latency

8

146.41

A100 SXM

FP16

Throughput

4

137.93

A100 SXM

FP16

Latency

8

146.41

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

x

x

x

x

Llama 3.1 Swallow 8B Instruct v0.1#

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 1, 2, 4

Llama 3.1 Swallow 70B Instruct v0.1#

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 2, 4, 8 –>

Meta Llama 3 8B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Throughput

1

28

H100 SXM

FP16

Latency

2

28

A100 SXM

FP16

Throughput

1

28

A100 SXM

FP16

Latency

2

28

L40S

FP8

Throughput

1

20.5

L40S

FP8

Latency

2

20.5

L40S

FP16

Throughput

1

28

A10G

FP16

Throughput

1

28

A10G

FP16

Latency

2

28

Generic configuration#

The Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

16

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

Meta Llama 3 70B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

4

82

H100 SXM

FP8

Latency

8

82

H100 SXM

FP16

Throughput

4

158

H100 SXM

FP16

Latency

8

158

A100 SXM

FP16

Throughput

4

158

Generic configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

240

FP16

100 SXM

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

x

Mistral 7B Instruct V0.3#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

7.08

H100 SXM

FP8

Latency

2

7.19

H100 SXM

BF16

Throughput

1

13.56

H100 SXM

BF16

Latency

2

7.19

A100 SXM

BF16

Throughput

1

13.56

A100 SXM

BF16

Latency

2

13.87

L40S

FP8

Throughput

1

7.08

L40S

FP8

Latency

2

7.16

L40S

BF16

Throughput

1

13.55

L40S

BF16

Latency

2

13.85

A10G

BF16

Throughput

2

13.87

A10G

BF16

Latency

4

14.48

Generic configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

16

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

Mistral NeMo Minitron 8B 8K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

8.91

H100 SXM

FP8

Latency

2

9.03

H100 SXM

FP16

Throughput

1

15.72

H100 SXM

FP16

Latency

2

16.78

A100 SXM

FP16

Throughput

1

15.72

A100 SXM

FP16

Latency

2

16.78

L40S

FP8

Throughput

1

8.92

L40S

FP8

Latency

2

9.02

L40S

FP16

Throughput

1

15.72

L40S

FP16

Latency

2

16.77

A10G

FP16

Throughput

2

16.81

A10G

FP16

Latency

4

15.72

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Mistral NeMo 12B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Latency

2

13.82

H100 SXM

FP16

Throughput

1

23.35

H100 SXM

FP16

Latency

2

25.14

A100 SXM

FP16

Throughput

1

23.35

A100 SXM

FP16

Latency

2

25.14

L40S

FP8

Throughput

2

13.83

L40S

FP8

Latency

4

15.01

L40S

FP16

Throughput

2

25.14

L40S

FP16

Latency

4

28.71

A10G

FP16

Throughput

4

28.71

A10G

FP16

Latency

8

35.87

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Mixtral 8x7B Instruct V0.1#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

100

H100 SXM

FP8

Latency

4

100

H100 SXM

INT8WO

Throughput

2

100

H100 SXM

INT8WO

Latency

4

100

H100 SXM

FP16

Throughput

2

100

H100 SXM

FP16

Latency

4

100

A100 SXM

FP16

Throughput

2

100

A100 SXM

FP16

Latency

4

100

L40S

FP8

Throughput

4

100

L40S

FP16

Throughput

4

100

A10G

FP16

Throughput

8

100

Generic configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

16

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

x

Mixtral 8x22B Instruct V0.1#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

8

132.61

H100 SXM

FP8

Latency

8

132.56

H100 SXM

INT8WO

Throughput

8

134.82

H100 SXM

INT8WO

Latency

8

132.31

H100 SXM

FP16

Throughput

8

265.59

A100 SXM

FP16

Throughput

8

265.7

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

x

Nemotron 4 340B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Latency

16

636.45

A100 SXM

FP16

Latency

16

636.45

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Nemotron 4 340B Instruct 128K#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

BF16

Latency

16

637.26

A100 SXM

BF16

Latency

16

637.22

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

Nemotron 4 340B Reward#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Latency

16

636.45

A100 SXM

FP16

Latency

16

636.45

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

x

x

Phi 3 Mini 4K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

3.8

H100 SXM

FP16

Throughput

1

7.14

A100 SXM

FP16

Throughput

1

7.14

L40S

FP8

Throughput

1

3.8

L40S

FP16

Throughput

1

7.14

A10G

FP16

Throughput

1

7.14

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Phind Codellama 34B V2 Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

32.17

H100 SXM

FP8

Latency

4

32.41

H100 SXM

FP16

Throughput

2

63.48

H100 SXM

FP16

Latency

4

64.59

A100 SXM

FP16

Throughput

2

63.48

A100 SXM

FP16

Latency

4

64.59

L40S

FP8

Throughput

4

32.43

L40S

FP16

Throughput

4

64.58

A10G

FP16

Latency

8

66.8

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x