Large Language Models (Latest)
Large Language Models (Latest)

Support Matrix

NVIDIA NIMs for large-language models should, but are not guaranteed to, run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations, including vGPU, are optimized. See the following Supported Models section for further information.

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

The GPU listed in the following sections have the following specifications.

GPU

Family

Memory

H100 SXM/NVLink 80GB
A100 SXM/NVLink 80GB
L40S PCIe 48GB
A10G PCIe 24GB

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.

Llama 2 7B Chat

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 1 6.57
H100 FP8 Latency 2 6.66
H100 FP16 Throughput 1 12.62
H100 FP16 Throughput LoRA 1 12.63
H100 FP16 Latency 2 12.93
A100 FP16 Throughput 1 15.54
A100 FP16 Throughput LoRA 1 12.63
A100 FP16 Latency 2 12.92
L40S FP8 Throughput 1 6.57
L40S FP8 Latency 2 6.64
L40S FP16 Throughput 1 12.64
L40S FP16 Throughput LoRA 1 12.65
L40S FP16 Latency 2 12.95

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 2 13B Chat

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Latency 2 12.6
H100 FP16 Throughput 1 24.33
H100 FP16 Throughput LoRA 1 24.35
H100 FP16 Latency 2 24.71
A100 FP16 Throughput 1 24.34
A100 FP16 Throughput LoRA 1 24.37
A100 FP16 Latency 2 24.74
L40S FP8 Throughput 1 12.49
L40S FP8 Latency 2 12.59
L40S FP16 Throughput 1 24.33
L40S FP16 Latency 2 24.7
L40S FP16 Throughput LoRA 1 24.37

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 2 70B-chat

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 2 65.08
H100 FP8 Latency 4 65.36
H100 FP16 Throughput 4 130.52
H100 FP16 Throughput LoRA 4 130.6
H100 FP16 Latency 8 133.18
A100 FP16 Throughput 4 130.52
A100 FP16 Throughput LoRA 4 130.5
A100 FP16 Latency 8 133.12
L40S FP8 Throughput 4 63.35

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Swallow 70B Instruct V0.1

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 2 68.42
H100 FP8 Latency 4 69.3
H100 FP16 Throughput 2 137.7
H100 FP16 Latency 4 145.94
A100 FP16 Throughput 2 137.7
A100 FP16 Latency 2 137.7
L40S FP8 Throughput 2 68.48
A10G FP16 Throughput 4 145.93

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Taiwan 70B Instruct

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 2 68.42
H100 FP8 Latency 4 145.94
H100 FP16 Throughput 2 137.7
H100 FP16 Latency 4 137.7
A100 FP16 Throughput 2 137.7
A100 FP16 Latency 2 145.94
L40S FP8 Throughput 2 68.48
A10G FP16 Throughput 4 145.93

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 405B Instruct

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 8 487
H100 FP16 Latency 16 793
A100 FP16 Latency 16 697

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral 7B Instruct V0.3

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 1 7.06
H100 FP8 Latency 2 7.16
H100 FP16 Throughput 1 13.54
H100 FP16 Latency 2 13.82
A100 FP16 Throughput 1 13.54
A100 FP16 Latency 2 13.82
L40S FP8 Throughput 1 7.06
L40S FP8 Latency 2 7.14
L40S FP16 Throughput 1 13.54
L40S FP16 Latency 2 13.82

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral NeMo 12B Instruct

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Latency 2 13.82
H100 FP16 Throughput 1 23.35
H100 FP16 Latency 2 25.14
A100 FP16 Throughput 1 23.35
A100 FP16 Latency 2 25.14
L40S FP8 Throughput 2 13.83
L40S FP8 Latency 4 15.01
L40S FP16 Throughput 2 25.14
L40S FP16 Latency 4 28.71
A10G FP16 Throughput 4 28.71
A10G FP16 Latency 8 35.87

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mixtral 8x22B Instruct V0.1

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 8 132.61
H100 FP8 Latency 8 132.56
H100 int8wo Throughput 8 134.82
H100 int8wo Latency 8 132.31
H100 FP16 Throughput 8 265.59
A100 FP16 Throughput 8 265.7

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mixtral 8x7B Instruct V0.1

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 2 43.91
H100 FP8 Latency 4 44.07
H100 FP16 Throughput 2 87.35
H100 FP16 Latency 4 87.95
A100 FP16 Throughput 2 87.35
L40S FP16 Throughput 4 87.95

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Instruct

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP16 Latency 16 6.27
A100 FP16 Latency 16 6.27

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Reward

Optimized configurations

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP16 Latency 16 6.37
A100 FP16 Latency 16 6.37

Non-optimized configuration

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Previous Models
Next API Reference
© Copyright © 2024, NVIDIA Corporation. Last updated on Oct 10, 2024.