Large Language Models (1.1.0)
Large Language Models (1.1.0)

Support Matrix

NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 (8.0 for bfloat16). Some model/GPU combinations, including vGPU, are optimized. See the following Supported Models section for further information.

  • Linux operating systems (Ubuntu 20.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

The GPU listed in the following sections have the following specifications.

GPU

Family

Memory

H100 SXM/NVLink 80GB
A100 SXM/NVLink 80GB
L40S PCIe 48GB
A10G PCIe 24GB

In general, NVIDIA recommends the following guidelines for models that NVIDIA NIMs support, but have not been either optimized for our TRT-LLM runtime nor tested against all of our GPUs in our lab. The values in these two tables are based on the number of parameters used during training.

Note

These values are estimates not guarantees.

GPUs

Both H100 and A100 should be 80GB SXM/NVLink models, L40S should be 48GB PCIe models, and A10G should be 24GB PCIe models.

Billion Parameters

H100

A100

L40S

A10G

8 or fewer 1 1 1 1
8 to 70 1 1 2 4
70 to 300 4 4 8 16
300+ 8 8 16 32

Disk Space

In general you can expect the vLLM runtime and a model to take up about 4X the billions of parameters in GB. Therefore, given a 400B model and vLLM runtime, the combination should occupy about 1.6TB of disk space.

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.

Llama 3 Swallow 70B Instruct V0.1

Optimized configurations

The Profile is for what the model is optimized; **LoRA is whether the model supports LoRA.

GPU

Precision

Profile

# of GPUs

LoRA

A100 fp16 Latency 8

A100 fp16 Throughput 4

A100 fp16 Throughput 4 Y
H100 fp8 Latency 8

H100 fp16 Latency 8

H100 fp8 Throughput 4

H100 fp16 Throughput 4

H100 fp16 Throughput 4 Y
L40S fp8 Latency 8

L40S fp16 Throughput 8

L40S fp8 Throughput 4

L40S fp16 Throughput 8 Y
A10G fp16 Throughput 8

Llama 3 Taiwan 70B Instruct

Optimized configurations

The Profile is for what the model is optimized; **LoRA is whether the model supports LoRA.

GPU

Precision

Profile

# of GPUs

LoRA

A100 fp16 Latency 8

A100 fp16 Throughput 4

A100 fp16 Throughput 4 Y
H100 fp8 Latency 8

H100 fp16 Latency 8

H100 fp8 Throughput 4

H100 fp16 Throughput 4

H100 fp16 Throughput 4 Y
L40S fp8 Latency 8

L40S fp16 Throughput 8

L40S fp8 Throughput 4

L40S fp16 Throughput 8 Y
A10G fp16 Throughput 8

Llama 3.1 8B Base

Optimized configurations

NVIDIA recommends at least 50GB disk space for the container and model.

The Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

H100 BF16 Latency 2
H100 FP8 Latency 2
H100 BF16 Throughput 1
H100 FP8 Throughput 1
H100 BF16 Throughput 1
A100 BF16 Latency 2
A100 BF16 Throughput 1
A100 BF16 Throughput 1
L40S BF16 Latency 2
L40S BF16 Throughput 2
L40S BF16 Throughput 2
A10G BF16 Latency 4
A10G BF16 Throughput 2
A10G BF16 Throughput 4

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory 24 FP16 15

Llama 3.1 8B Instruct

Optimized configurations

NVIDIA recommends at least 50GB disk space for the container and model.

The Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

H100 BF16 Latency 2
H100 FP8 Latency 2
H100 BF16 Throughput 1
H100 FP8 Throughput 1
H100 BF16 Throughput 1
A100 BF16 Latency 2
A100 BF16 Throughput 1
A100 BF16 Throughput 1
L40S BF16 Latency 2
L40S BF16 Throughput 2
L40S BF16 Throughput 1
L40S BF16 Throughput 2
A10G BF16 Latency 4
A10G BF16 Throughput 2
A10G BF16 Throughput 4

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory 24 FP16 15

Llama 3.1 70B Instruct

NVIDIA recommends at least 350GB disk space for the container and model.

Optimized configurations

The Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

H100 BF16 Latency 8
H100 FP8 Latency 8
H100 BF16 Throughput 4
H100 FP8 Throughput 4
H100 BF16 Throughput 4
A100 BF16 Latency 8
A100 BF16 Throughput 4
A100 BF16 Throughput 4
L40S BF16 Throughput 8
L40S BF16 Throughput 8

Llama 3.1 405B Instruct

NVIDIA recommends at least 1.5TB disk space for the container and model.

Note

Only optimized profiles are available for Llama 3.1 405B Instruct.

GPU

Precision

Profile

# of GPUs

H100 FP16 Latency 16
H100 FP8 Latency 16
H100 FP8 Throughput 8
A100 FP16 Latency 16

Meta-Llama-3-8B-Instruct

Optimized configurations

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP16 Throughput 1 28
H100 FP16 Latency 2 28
A100 FP16 Throughput 1 28
A100 FP16 Latency 2 28
L40S FP8 Throughput 1 20.5
L40S FP8 Latency 2 20.5
L40S FP16 Throughput 1 28
A10G FP16 Throughput 1 28
A10G FP16 Latency 2 28

Non-optimized configuration

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Meta-Llama-3-70B-Instruct

Optimized configurations

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 4 82
H100 FP8 Latency 8 82
H100 FP16 Throughput 4 158
H100 FP16 Latency 8 158
A100 FP16 Throughput 4 158

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory 240 FP16 100

Mistral-7B-Instruct-v0.3

Optimized configurations

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Latency 2 7.16
H100 FP16 Latency 2 13.82
H100 FP8 Throughput 1 7.06
H100 FP16 Throughput 1 13.54
A100 FP16 Latency 2 13.82
A100 FP16 Throughput 1 13.54
L40S FP8 Latency 2 7.14
L40S FP16 Latency 2 13.82
L40S FP8 Throughput 1 7.06
L40S FP16 Throughput 1 13.54

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory 24 FP16 16

Mixtral-8x7B-v0.1

Optimized configurations

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Latency 4 7.16
H100 FP8 Throughput 2 7.06
H100 FP16 Latency 4 13.82
H100 FP16 Throughput 2 13.54
A100 FP16 Throughput 4 13.82
A100 FP16 Throughput 2 13.54
L40S FP16 Throughput 4 13.82

Non-optimized configuration

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory 24 FP16 16

Mistral-NeMo-12B-Instruct

Optimized configurations

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

The GPU Memory values are in GB; the Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP16 Throughput 1 23.35
H100 FP16 Latency 2 25.14
H100 FP8 Latency 2 13.82
A100 FP16 Throughput 1 23.35
A100 FP16 Latency 2 25.14
L40S FP16 Throughput 2 25.14
L40S FP16 Latency 4 28.71
L40S FP8 Throughput 2 13.83
L40S FP8 Latency 4 15.01
A10G FP16 Throughput 4 28.71
A10G FP16 Latency 4 35.87

Non-optimized configuration

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mixtral-8x22B-v0.1

Optimized configurations

The Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

Disk Space

H100 FP8 Throughput 8 132.61
H100 Int4wo Throughput 8 134.82
H100 FP16 Throughput 8 265.59
A100 FP16 Throughput 8 265.7

Nemotron 4 340B Instruct

Optimized configurations

The Profile is for what the model is optimized.

GPU

Precision

Profile

# of GPUs

H100 FP16 Latency 16
A100 FP16 Latency 16

Non-optimized configuration

Any NVIDIA GPU with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability > 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Previous Models
Next API Reference
© Copyright © 2024, NVIDIA Corporation. Last updated on Sep 9, 2024.