Support Matrix - NVIDIA Docs

Hardware

NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 or higher. Some model/GPU combinations are optimized. See the following Supported Models section for further information.

Software

Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1

Supported Models

These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.

Llama 3 8B Instruct

Optimized configurations

The GPU Memory and Disc Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100	80	FP8	Throughput	1	8.5
H100	80	FP8	Latency	2	8.5
H100	80	FP16	Throughput	1	16
H100	80	FP16	Latency	2	16
A100	80	FP16	Throughput	1	16
A100	80	FP16	Latency	2	16
L40S PCIe	48	FP8	Throughput	1	8.5
L40S PCIe	48	FP8	Latency	2	8.5
L40S PCIe	48	FP16	Throughput	1	16
A10G PCIe	24	FP16	Throughput	1	16
A10G PCIe	24	FP16	Latency	2	16

Non-optimized configuration

The GPU Memory and Disc Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher	24	FP16	16

Llama 3 70B Instruct

Optimized configurations

The GPU Memory and Disc Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.

GPU	GPU Memory	Precision	Profile	# of GPUs	Disk Space
H100	80	FP8	Throughput	4	85
H100	80	FP8	Latency	8	85
H100	80	FP16	Throughput	4	155
H100	80	FP16	Latency	8	155
A100	80	FP16	Throughput	4	155
L40S PCIe	48	FP8	Throughput	4	85
L40S PCIe	48	FP8	Latency	8	85
L40S PCIe	48	FP16	Throughput	8	155

Non-optimized configuration

The GPU Memory and Disc Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	240	FP16	100

Supported LoRA formats

The following LoRA formats are supported:

Foundation Model	HuggingFace Format	NeMo Format
Meta-Llama3-8b-Instruct	fused/grouped - Query Attention	fused/grouped - Query Attention
Meta-Llama3-70b-Instruct	fused/grouped - Query Attention	fused/grouped - Query Attention