Large Language Models (1.0.0)
Large Language Models (1.0.0)

Support Matrix

NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations are optimized. See the following Supported Models section for further information.

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Llama 2 7B Chat

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink 80 8 Latency 2 6.66
H100 SXM/NVLink 80 16 Latency 2 12.93
H100 SXM/NVLink 80 8 Throughput 1 6.57
H100 SXM/NVLink 80 16 Throughput 1 12.62
H100 SXM/NVLink 80 16 Throughput LoRA 1 12.63
A100 SXM/NVLink 80 16 Latency 2 12.92
A100 SXM/NVLink 80 16 Throughput 1 15.54
A100 SXM/NVLink 80 16 Throughput LoRA 1 12.63
L40S PCIe 48 8 Latency 2 6.64
L40S PCIe 48 16 Latency 2 12.95
L40S PCIe 48 8 Throughput 1 6.57
L40S PCIe 48 16 Throughput 1 12.64
L40S PCIe 48 16 Throughput LoRA 1 12.65

Non-optimized configuration

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Llama 2 13B Chat

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink 80 8 Latency 2 12.6
H100 SXM/NVLink 80 16 Latency 2 24.71
H100 SXM/NVLink 80 16 Throughput 1 24.33
H100 SXM/NVLink 80 16 Throughput LoRA 1 24.35
A100 SXM/NVLink 80 16 Latency 2 24.74
A100 SXM/NVLink 80 16 Throughput 2 24.34
A100 SXM/NVLink 80 16 Throughput LoRA 1 24.37
L40S PCIe 48 8 Latency 2 12.39
L40S PCIe 48 16 Latency 2 24.7
L40S PCIe 48 8 Throughput 1 12.49
L40S PCIe 48 16 Throughput 1 24.33
L40S PCIe 48 16 Throughput LoRA 1 24.37

Non-optimized configuration

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Llama 2 70B Chat

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink 80 16 Throughput 4 130.52
H100 SXM/NVLink 80 8 Latency 4 65.36
H100 SXM/NVLink 80 16 Latency 8 133.18
H100 SXM/NVLink 80 8 Throughput 2 65.08
H100 SXM/NVLink 80 16 Throughput LoRA 4 130.6
A100 SXM/NVLink 80 16 Latency 4 133.12
A100 SXM/NVLink 80 16 Throughput 4 130.52
A100 SXM/NVLink 80 16 Throughput LoRA 4 130.5
L40S PCIe 48 8 Throughput 4 63.35

Non-optimized configuration

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16)

Meta-Llama-3-8B-Instruct

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 80 FP16 Throughput 1 28
H100 80 FP16 Latency 2 28
A100 80 FP16 Throughput 1 28
A100 80 FP16 Latency 2 28
L40S PCIe 48 FP8 Throughput 1 20.5
L40S PCIe 48 FP8 Latency 2 20.5
L40S PCIe 48 FP16 Throughput 1 28
A10G PCIe 24 FP16 Throughput 1 28
A10G PCIe 24 FP16 Latency 2 28

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) 24 FP16 16

Meta-Llama-3-70B-Instruct

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 80 FP8 Throughput 4 82
H100 80 FP8 Latency 8 82
H100 80 FP16 Throughput 4 158
H100 80 FP16 Latency 8 158
A100 80 FP16 Throughput 4 158
L40S PCIe 48 FP8 Throughput 4 82
L40S PCIe 48 FP8 Latency 8 82
L40S PCIe 48 FP16 Throughput 8 158

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 (8.0 for bfloat16) 240 FP16 100

Mistral-7B-Instruct-v0.3

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink 80 FP8 Latency 2 7.16
H100 SXM/NVLink 80 FP16 Latency 2 13.82
H100 SXM/NVLink 80 FP8 Throughput 1 7.06
H100 SXM/NVLink 80 FP16 Throughput 1 13.54
A100 SXM/NVLink 80 FP16 Latency 2 13.82
A100 SXM/NVLink 80 FP16 Throughput 1 13.54
L40S PCIe 48 FP8 Latency 2 7.14
L40S PCIe 48 FP16 Latency 2 13.82
L40S PCIe 48 FP8 Throughput 1 7.06
L40S PCIe 48 FP16 Throughput 1 13.54

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) 24 FP16 16

Mixtral-8x7B-v0.1

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink 80 FP8 Latency 4 7.16
H100 SXM/NVLink 80 FP8 Throughput 2 7.06
H100 SXM/NVLink 80 FP16 Latency 4 13.82
H100 SXM/NVLink 80 FP16 Throughput 2 13.54
A100 SXM/NVLink 80 FP16 Throughput 4 13.82
A100 SXM/NVLink 80 FP16 Throughput 2 13.54
L40S PCIe 48 FP16 Throughput 4 13.82

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher (8.0 for bfloat16) 24 FP16 16

Mixtral-8x22B-v0.1

Optimized configurations

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

GPU Memory

Precision

Profile

# of GPUs

Disk Space

H100 SXM/NVLink 80 FP8 Throughput 8 132.61
H100 SXM/NVLink 80 Int4wo Throughput 8 134.82
H100 SXM/NVLink 80 FP16 Throughput 8 265.59
A100 SXM/NVLink 80 FP16 Throughput 8 265.7

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory 240 FP16 100

The following LoRA formats are supported:

Foundation Model

HuggingFace Format

NeMo Format

Meta-Llama3-8b-Instruct Yes Yes
Meta-Llama3-70b-Instruct Yes Yes
Previous Introduction to LLM Inference Benchmarking
Next API Reference
© Copyright © 2024, NVIDIA Corporation. Last updated on Aug 21, 2024.