Support Matrix#

Hardware#

NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations are optimized. See the following Supported Models section for further information.

Software#

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

Supported Models#

These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Llama 2 7B Chat#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink

80

8

Latency

2

6.66

H100 SXM/NVLink

80

16

Latency

2

12.93

H100 SXM/NVLink

80

8

Throughput

1

6.57

H100 SXM/NVLink

80

16

Throughput

1

12.62

H100 SXM/NVLink

80

16

Throughput LoRA

1

12.63

A100 SXM/NVLink

80

16

Latency

2

12.92

A100 SXM/NVLink

80

16

Throughput

1

15.54

A100 SXM/NVLink

80

16

Throughput LoRA

1

12.63

L40S PCIe

48

8

Latency

2

6.64

L40S PCIe

48

16

Latency

2

12.95

L40S PCIe

48

8

Throughput

1

6.57

L40S PCIe

48

16

Throughput

1

12.64

L40S PCIe

48

16

Throughput LoRA

1

12.65

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Llama 2 13B Chat#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink

80

8

Latency

2

12.6

H100 SXM/NVLink

80

16

Latency

2

24.71

H100 SXM/NVLink

80

16

Throughput

1

24.33

H100 SXM/NVLink

80

16

Throughput LoRA

1

24.35

A100 SXM/NVLink

80

16

Latency

2

24.74

A100 SXM/NVLink

80

16

Throughput

2

24.34

A100 SXM/NVLink

80

16

Throughput LoRA

1

24.37

L40S PCIe

48

8

Latency

2

12.39

L40S PCIe

48

16

Latency

2

24.7

L40S PCIe

48

8

Throughput

1

12.49

L40S PCIe

48

16

Throughput

1

24.33

L40S PCIe

48

16

Throughput LoRA

1

24.37

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Llama 2 70B Chat#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink

80

16

Throughput

4

130.52

H100 SXM/NVLink

80

8

Latency

4

65.36

H100 SXM/NVLink

80

16

Latency

8

133.18

H100 SXM/NVLink

80

8

Throughput

2

65.08

H100 SXM/NVLink

80

16

Throughput LoRA

4

130.6

A100 SXM/NVLink

80

16

Latency

4

133.12

A100 SXM/NVLink

80

16

Throughput

4

130.52

A100 SXM/NVLink

80

16

Throughput LoRA

4

130.5

L40S PCIe

48

8

Throughput

4

63.35

Non-optimized configuration#

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Meta-Llama-3-8B-Instruct#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100

80

FP16

Throughput

1

28

H100

80

FP16

Latency

2

28

A100

80

FP16

Throughput

1

28

A100

80

FP16

Latency

2

28

L40S PCIe

48

FP8

Throughput

1

20.5

L40S PCIe

48

FP8

Latency

2

20.5

L40S PCIe

48

FP16

Throughput

1

28

A10G PCIe

24

FP16

Throughput

1

28

A10G PCIe

24

FP16

Latency

2

28

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

24

FP16

16

Meta-Llama-3-70B-Instruct#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100

80

FP8

Throughput

4

82

H100

80

FP8

Latency

8

82

H100

80

FP16

Throughput

4

158

H100

80

FP16

Latency

8

158

A100

80

FP16

Throughput

4

158

L40S PCIe

48

FP8

Throughput

4

82

L40S PCIe

48

FP8

Latency

8

82

L40S PCIe

48

FP16

Throughput

8

158

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

240

FP16

100

Mistral-7B-Instruct-v0.3#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink

80

FP8

Latency

2

7.16

H100 SXM/NVLink

80

FP16

Latency

2

13.82

H100 SXM/NVLink

80

FP8

Throughput

1

7.06

H100 SXM/NVLink

80

FP16

Throughput

1

13.54

A100 SXM/NVLink

80

FP16

Latency

2

13.82

A100 SXM/NVLink

80

FP16

Throughput

1

13.54

L40S PCIe

48

FP8

Latency

2

7.14

L40S PCIe

48

FP16

Latency

2

13.82

L40S PCIe

48

FP8

Throughput

1

7.06

L40S PCIe

48

FP16

Throughput

1

13.54

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16)

24

FP16

16

Mixtral-8x7B-v0.1#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink

80

FP8

Latency

4

7.16

H100 SXM/NVLink

80

FP8

Throughput

2

7.06

H100 SXM/NVLink

80

FP16

Latency

4

13.82

H100 SXM/NVLink

80

FP16

Throughput

2

13.54

A100 SXM/NVLink

80

FP16

Throughput

4

13.82

A100 SXM/NVLink

80

FP16

Throughput

2

13.54

L40S PCIe

48

FP16

Throughput

4

13.82

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16)

24

FP16

16

Mixtral-8x22B-v0.1#

Optimized configurations#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink

80

FP8

Throughput

8

132.61

H100 SXM/NVLink

80

Int4wo

Throughput

8

134.82

H100 SXM/NVLink

80

FP16

Throughput

8

265.59

A100 SXM/NVLink

80

FP16

Throughput

8

265.7

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory

240

FP16

100

Supported LoRA formats#

The following LoRA formats are supported:

Foundation Model

HuggingFace Format

NeMo Format

Meta-Llama3-8b-Instruct

Yes

Yes

Meta-Llama3-70b-Instruct

Yes

Yes