Supported Models#

GPUs#

The GPU listed in the following sections have the following specifications.

GPU

Family

Memory

H200

SXM/NVLink

141 GB

H100

SXM/NVLink

80 GB

A100

SXM/NVLink

80 GB

L40S

PCIe

48 GB

A10G

PCIe

24 GB

NVIDIA RTX 6000 Ada Generation

32 GB

GeForce RTX 5090

32 GB

GeForce RTX 5080

16 GB

GeForce RTX 4090

24 GB

GeForce RTX 4080

16 GB

Optimized Models#

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. Generic model profiles can be identified by the presence of local_build or vllm in the profile name. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps at Profile Selection.

You can also find additional information about the features, such as LoRA, that these models support in Models.

Code Llama 13B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Throughput

2

24.63

H100 SXM

FP16

Latency

4

25.32

A100 SXM

FP16

Throughput

2

24.63

A100 SXM

FP16

Latency

4

25.31

L40S

FP16

Throughput

2

25.32

L40S

FP16

Latency

2

24.63

A10G

FP16

Throughput

4

25.32

A10G

FP16

Latency

8

26.69

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Code Llama 34B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

32.17

H100 SXM

FP8

Latency

4

32.42

H100 SXM

FP16

Throughput

2

63.48

H100 SXM

FP16

Latency

4

64.59

A100 SXM

FP16

Throughput

2

63.48

A100 SXM

FP16

Latency

4

64.59

L40S

FP8

Throughput

4

32.42

L40S

FP16

Throughput

4

64.58

A10G

FP16

Throughput

4

64.58

A10G

FP16

Latency

8

66.8

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Code Llama 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

4

65.47

H100 SXM

FP8

Latency

8

66.37

H100 SXM

FP16

Throughput

4

130.35

H100 SXM

FP16

Latency

8

66.37

A100 SXM

FP16

Throughput

4

130.35

A100 SXM

FP16

Latency

8

132.71

A10G

FP16

Throughput

8

132.69

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1#

Supported Configurations#

The following configurations support this model:

  • 8 x H200 (1)

  • 2 Nodes of [8 x H100] for 16 total H100 GPU’s

Refer to the NGC catalog entry for further information.

DeepSeek R1 Distill Llama 8B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in.

GPU

Precision

Profile

# of GPUs

Disk Space

H200 SXM

FP8

Throughput

1

8.58

H200 SXM

FP8

Latency

2

8.72

H200 SXM

BF16

Throughput

1

15.05

H200 SXM

BF16

Latency

2

16.12

H100 SXM

FP8

Throughput

1

8.58

H100 SXM

FP8

Latency

2

8.74

H100 SXM

BF16

Throughput

1

15.05

H100 SXM

BF16

Latency

2

16.12

H100 NVL

FP8

Throughput

1

8.58

H100 NVL

FP8

Latency

2

8.73

H100 NVL

BF16

Latency

2

16.12

H100 NVL

BF16

Throughput

1

15.0

A100 SXM

BF16

Throughput

1

15.16

A100 SXM

BF16

Latency

2

16.36

L40S

FP8

Throughput

1

8.58

L40S

FP8

Latency

2

8.71

L40S

BF16

Throughput

1

15.14

L40S

BF16

Latency

2

16.32

A10G

BF16

Throughput

2

16.12

A10G

BF16

Latency

4

18.25

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1 Distill Llama 70B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H200

FP8

Latency

4

68.66

H200

FP8

Throughput

2

68.12

H200

BF16

Latency

8

146.18

H200

BF16

Throughput

4

137.77

H100

FP8

Latency

4

68.65

H100

FP8

Throughput

2

68.18

H100

FP8

Latency

8

69.6

H100

BF16

Latency

8

146.18

H100

BF16

Throughput

4

137.77

A100

BF16

Latency

8

146.19

A100

BF16

Throughput

4

137.82

L40S

FP8

Throughput

4

68.57

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

  • 1.5

DeepSeek R1 Distill Llama 8B RTX#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

NVIDIA RTX 6000 Ada Generation

INT4 AWQ

Throughput

1

5.42

GeForce RTX 5090

INT4 AWQ

Throughput

1

5.42

GeForce RTX 5080

INT4 AWQ

Throughput

1

5.42

GeForce RTX 4090

INT4 AWQ

Throughput

1

5.42

GeForce RTX 4080

INT4 AWQ

Throughput

1

5.42

DeepSeek-R1-Distill-Qwen-32B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H200

BF16

Throughput

1

61.19

H100

BF16

Throughput

1

61.19

H200

BF16

Throughput

2

62.77

H20

BF16

Throughput

1

61.19

A100

BF16

Throughput

1

61.18

L40S

BF16

Throughput

2

62.79

L20

BF16

Throughput

2

62.8

L40S

FP8

Throughput

2

32.49

H200

FP8

Throughput

1

32.15

H200

FP8

Throughput

2

32.45

H100

FP8

Throughput

1

32.14

H20

FP8

Throughput

1

32.12

L20

FP8

Throughput

1

32.16

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

  • 1.8

Additional Information#

Organization

Catalog Page

LoRA Support

Tool Calling Support

Parallel Tool Calling Support

DeepSeek

DeepSeek-R1-Distill-Qwen-32B

No

No

No

DeepSeek-R1-Distill-Qwen-7B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H200

BF16

Throughput

1

21.93

H200

FP8

Throughput

1

15.85

H100

BF16

Throughput

1

21.94

H100

FP8

Throughput

1

15.84

H20

BF16

Throughput

1

22.00

H20

FP8

Throughput

1

15.83

L20

BF16

Throughput

1

21.97

L20

FP8

Throughput

1

15.84

A100

BF16

Throughput

1

21.98

A10G

BF16

Throughput

1

21.97

Supported Releases#

  • 1.5

Additional Information#

Organization

Catalog Page

LoRA Support

Tool Calling Support

Parallel Tool Calling Support

DeepSeek

DeepSeek-R1-Distill-Qwen-7B

No

No

No

DeepSeek-R1-Distill-Qwen-14B#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H20

FP8

Throughput

1

22.52

H20

BF16

Throughput

1

34.98

L20

FP8

Throughput

1

22.54

L20

BF16

Throughput

1

34.96

H100

FP8

Throughput

1

22.54

H200

FP8

Throughput

1

22.54

H200

BF16

Throughput

1

34.87

L40S

FP8

Throughput

1

22.55

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

1.5

1.6

x

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 1

Qwen2.5 72B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H20

FP8

Throughput

4

77.71

H20

FP8

Throughput

8

77.96

H20

FP8

Latency

4

78.22

H20

FP8

Latency

8

78.98

L20

FP8

Throughput

4

78.14

L20

FP8

Throughput

8

79.15

L20

FP8

Latency

4

78.14

L20

FP8

Latency

8

78.89

A100 SXM

BF16

Throughput

4

150.35

A100 SXM

BF16

Latency

8

160.18

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Qwen2.5 7B Instruct#

Optimized configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

L20

FP16

Throughput

1

21.66

A100 PCIe 40GB

FP16

Latency

1

21.66

A100 PCIe 40GB

BF16

Throughput

1

21.66

A100 PCIe 40GB

FP16

Balanced

1

21.66

A100 SXM/NVLink

FP16

Latency

1

21.66

A100 SXM/NVLink

BF16

Throughput

1

21.66

A100 SXM/NVLink

BF16

Balanced

1

21.66

Generic configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Releases#

This model is supported in the following releases.

1

1.0

1.0.0

1.0.1

1.0.3

1.1

1.1.0

1.1.1

1.1.2

1.2

1.2.0

1.2.1

1.2.3

1.3

1.4

x

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16,FP16

  • # of GPUs: 1

Gemma 2 2B#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 1, 2

Gemma 2 9B#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 1, 2, or 4

(Meta) Llama 2 7B Chat#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

6.57

H100 SXM

FP8

Latency

2

6.66

H100 SXM

FP16

Throughput

1

12.62

H100 SXM

FP16

Throughput LoRA

1

12.63

H100 SXM

FP16

Latency

2

12.93

A100 SXM

FP16

Throughput

1

15.54

A100 SXM

FP16

Throughput LoRA

1

12.63

A100 SXM

FP16

Latency

2

12.92

L40S

FP8

Throughput

1

6.57

L40S

FP8

Latency

2

6.64

L40S

FP16

Throughput

1

12.64

L40S

FP16

Throughput LoRA

1

12.65

L40S

FP16

Latency

2

12.95

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

(Meta) Llama 2 13B Chat#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Latency

2

12.6

H100 SXM

FP16

Throughput

1

24.33

H100 SXM

FP16

Throughput LoRA

1

24.35

H100 SXM

FP16

Latency

2

24.71

A100 SXM

FP16

Throughput

1

24.34

A100 SXM

FP16

Throughput LoRA

1

24.37

A100 SXM

FP16

Latency

2

24.74

L40S

FP8

Throughput

1

12.49

L40S

FP8

Latency

2

12.59

L40S

FP16

Throughput

1

24.33

L40S

FP16

Throughput LoRA

1

24.37

L40S

FP16

Latency

2

24.7

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

(Meta) Llama 2 70B Chat#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

65.08

H100 SXM

FP8

Latency

4

65.36

H100 SXM

FP16

Throughput

4

130.52

H100 SXM

FP16

Throughput LoRA

4

130.6

H100 SXM

FP16

Latency

8

133.18

A100 SXM

FP16

Throughput

4

130.52

A100 SXM

FP16

Throughput LoRA

4

130.5

A100 SXM

FP16

Latency

8

133.12

L40S

FP8

Throughput

4

63.35

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 SQLCoder 8B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk space

H100 SXM

FP8

Throughput

1

8.52

H100 SXM

FP8

Latency

2

8.61

H100 SXM

FP16

Throughput

1

15

H100 SXM

FP16

Latency

2

16.02

L40S

FP8

Throughput

1

8.53

L40S

FP8

Latency

2

8.61

L40S

FP16

Throughput

1

15

L40S

FP16

Latency

2

16.02

A10G

FP16

Throughput

1

15

A10G

FP16

Throughput

2

16.02

A10G

FP16

Latency

2

16.02

A10G

FP16

Latency

4

18.06

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Swallow 70B Instruct V0.1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

68.42

H100 SXM

FP8

Latency

4

69.3

H100 SXM

FP16

Throughput

2

137.7

H100 SXM

FP16

Latency

4

145.94

A100 SXM

FP16

Throughput

2

137.7

A100 SXM

FP16

Latency

2

137.7

L40S

FP8

Throughput

2

68.48

A10G

FP16

Throughput

4

145.93

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Taiwan 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

68.42

H100 SXM

FP8

Latency

4

145.94

H100 SXM

FP16

Throughput

2

137.7

H100 SXM

FP16

Latency

4

137.7

A100 SXM

FP16

Throughput

2

137.7

A100 SXM

FP16

Latency

2

145.94

L40S

FP8

Throughput

2

68.48

A10G

FP16

Throughput

4

145.93

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 8B Base#

Optimized Configurations#

Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

H100 SXM

BF16

Latency

2

H100 SXM

FP8

Latency

2

H100 SXM

BF16

Throughput

1

H100 SXM

FP8

Throughput

1

A100 SXM

BF16

Latency

2

A100 SXM

BF16

Throughput

1

L40S

BF16

Latency

2

L40S

BF16

Throughput

2

A10G

BF16

Latency

4

A10G

BF16

Throughput

2

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

15

Llama 3.1 8B Instruct#

Optimized Configurations#

Profile is for what the model is optimized.

GPU

Profile

# of GPUs

H100 SXM

Throughput

1

H100 SXM

Latency

2

H100 NVL

Throughput

1

H100 NVL

Latency

2

A100 SXM

Throughput

1

A100 SXM

Latency

2

L40S

Throughput

2

L40S

Latency

4

A10G

Throughput

2

A10G

Latency

4

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

15

Llama 3.1 8B Instruct RTX#

Optimized Configurations#

Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

NVIDIA RTX 6000 Ada Generation

INT4 AWQ

Throughput

1

5.42

GeForce RTX 5090

INT4 AWQ

Throughput

1

5.42

GeForce RTX 5080

INT4 AWQ

Throughput

1

5.41

GeForce RTX 4090

INT4 AWQ

Throughput

1

5.42

GeForce RTX 4080

INT4 AWQ

Throughput

1

5.42

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

15

Llama 3.2 1B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: One H100, A100, L40S, or A10G

Llama 3.2 3B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: One H100, A100, or L40S

Llama 3.1 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H200 SXM

FP8

Throughput

1

67.87

H200 SXM

FP8

Latency

2

68.2

H200 SXM

BF16

Throughput

2

133.72

H200 SXM

BF16

Latency

4

137.99

H100 SXM

FP8

Throughput

2

68.2

H100 SXM

FP8

Throughput

4

68.72

H100 SXM

FP8

Latency

8

69.71

H100 SXM

BF16

Throughput

4

138.39

H100 SXM

BF16

Latency

8

147.66

H100 NVL

FP8

Throughput

2

68.2

H100 NVL

FP8

Latency

4

68.72

H100 NVL

BF16

Throughput

2

133.95

H100 NVL

BF16

Throughput

4

138.4

H100 NVL

BF16

Latency

8

147.37

A100 SXM

BF16

Throughput

4

138.53

A100 SXM

BF16

Latency

8

147.44

L40S

BF16

Throughput

4

138.49

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 405B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Latency

8

388.75

H100 SXM

FP16

Latency

16

794.9

A100 SXM

PP16

Latency

16

798.2

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

240

FP16

100 SXM

Llama 3.1 Nemotron 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk space

H100 SXM

FP8

Throughput

2

68.18

H100 SXM

FP8

Throughput

4

68.64

H100 SXM

FP8

Latency

8

69.77

H100 SXM

FP16

Throughput

4

137.94

H100 SXM

FP16

Latency

8

146.41

A100 SXM

FP16

Throughput

4

137.93

A100 SXM

FP16

Latency

8

146.41

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 Swallow 8B Instruct v0.1#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 1, 2, 4

Llama 3.1 Swallow 70B Instruct v0.1#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 2, 4, 8

Llama 3.3 70B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

  • Precision: BF16

  • # of GPUs: 4, 8

Meta Llama 3 8B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Throughput

1

28

H100 SXM

FP16

Latency

2

28

A100 SXM

FP16

Throughput

1

28

A100 SXM

FP16

Latency

2

28

L40S

FP8

Throughput

1

20.5

L40S

FP8

Latency

2

20.5

L40S

FP16

Throughput

1

28

A10G

FP16

Throughput

1

28

A10G

FP16

Latency

2

28

Generic Configuration#

The Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

16

Llama 3.3 Nemotron Super 49B V1#

GPU

Precision

# of GPUs

H200 SXM

BF16

2

H100 SXM

BF16

2

H100 NVL

BF16

2

A100 SXM

BF16

2

L40S

BF16

4

A10G

BF16

8

Meta Llama 3 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

4

82

H100 SXM

FP8

Latency

8

82

H100 SXM

FP16

Throughput

4

158

H100 SXM

FP16

Latency

8

158

A100 SXM

FP16

Throughput

4

158

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

240

FP16

100 SXM

Mistral 7B Instruct V0.3#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

7.08

H100 SXM

FP8

Latency

2

7.19

H100 SXM

BF16

Throughput

1

13.56

H100 SXM

BF16

Latency

2

7.19

A100 SXM

BF16

Throughput

1

13.56

A100 SXM

BF16

Latency

2

13.87

L40S

FP8

Throughput

1

7.08

L40S

FP8

Latency

2

7.16

L40S

BF16

Throughput

1

13.55

L40S

BF16

Latency

2

13.85

A10G

BF16

Throughput

2

13.87

A10G

BF16

Latency

4

14.48

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

16

Mistral NeMo Minitron 8B 8K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

8.91

H100 SXM

FP8

Latency

2

9.03

H100 SXM

FP16

Throughput

1

15.72

H100 SXM

FP16

Latency

2

16.78

A100 SXM

FP16

Throughput

1

15.72

A100 SXM

FP16

Latency

2

16.78

L40S

FP8

Throughput

1

8.92

L40S

FP8

Latency

2

9.02

L40S

FP16

Throughput

1

15.72

L40S

FP16

Latency

2

16.77

A10G

FP16

Throughput

2

16.81

A10G

FP16

Latency

4

15.72

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral NeMo 12B Instruct RTX#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

NVIDIA RTX 6000 Ada Generation

INT4 AWQ

Throughput

1

31

GeForce RTX 5090

INT4 AWQ

Throughput

1

31

GeForce RTX 5080

INT4 AWQ

Throughput

1

31

GeForce RTX 4090

INT4 AWQ

Throughput

1

31

GeForce RTX 4080

INT4 AWQ

Throughput

1

31

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral NeMo 12B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Latency

2

13.82

H100 SXM

FP16

Throughput

1

23.35

H100 SXM

FP16

Latency

2

25.14

A100 SXM

FP16

Throughput

1

23.35

A100 SXM

FP16

Latency

2

25.14

L40S

FP8

Throughput

2

13.83

L40S

FP8

Latency

4

15.01

L40S

FP16

Throughput

2

25.14

L40S

FP16

Latency

4

28.71

A10G

FP16

Throughput

4

28.71

A10G

FP16

Latency

8

35.87

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mixtral 8x7B Instruct V0.1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Latency

4

100

H100 SXM

INT8WO

Throughput

2

100

H100 SXM

INT8WO

Latency

4

100

H100 SXM

FP16

Throughput

2

100

H100 SXM

FP16

Latency

4

100

A100 SXM

FP16

Throughput

2

100

A100 SXM

FP16

Latency

4

100

L40S

FP8

Throughput

4

100

L40S

FP16

Throughput

4

100

A10G

FP16

Throughput

8

100

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

24

FP16

16

Mixtral 8x22B Instruct V0.1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

8

132.61

H100 SXM

FP8

Latency

8

132.56

H100 SXM

INT8WO

Throughput

8

134.82

H100 SXM

INT8WO

Latency

8

132.31

H100 SXM

FP16

Throughput

8

265.59

A100 SXM

FP16

Throughput

8

265.7

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

StarCoder2 7B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100

BF16

Throughput

1

13.89

H100

BF16

Latency

2

14.44

H100

FP8

Throughput

1

7.56

H100

FP8

Latency

2

7.41

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Latency

16

636.45

A100 SXM

FP16

Latency

16

636.45

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Instruct 128K#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

BF16

Latency

16

637.26

A100 SXM

BF16

Latency

16

637.22

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Reward#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP16

Latency

16

636.45

A100 SXM

FP16

Latency

16

636.45

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Phi 3 Mini 4K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

1

3.8

H100 SXM

FP16

Throughput

1

7.14

A100 SXM

FP16

Throughput

1

7.14

L40S

FP8

Throughput

1

3.8

L40S

FP16

Throughput

1

7.14

A10G

FP16

Throughput

1

7.14

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Phind Codellama 34B V2 Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 SXM

FP8

Throughput

2

32.17

H100 SXM

FP8

Latency

4

32.41

H100 SXM

FP16

Throughput

2

63.48

H100 SXM

FP16

Latency

4

64.59

A100 SXM

FP16

Throughput

2

63.48

A100 SXM

FP16

Latency

4

64.59

L40S

FP8

Throughput

4

32.43

L40S

FP16

Throughput

4

64.58

A10G

FP16

Latency

8

66.8

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

StarCoderBase 15.5B#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM buildable profiles#

  • Precision: FP32

  • # of GPUs: 2, 4, 8