NVIDIA NIMs for large-language models will run on any NVIDIA GPU, as long as the GPU has sufficient memory, or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory and CUDA compute capability > 7.0 or higher. Some model/GPU combinations are optimized. See the following Supported Models section for further information.
Linux operating systems (Ubuntu 20.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
These models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint.
Llama 3 8B Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU | GPU Memory | Precision | Profile | # of GPUs | Disk Space |
---|---|---|---|---|---|
H100 | 80 | FP8 | Throughput | 1 | 20.5 |
H100 | 80 | FP8 | Latency | 2 | 20.5 |
H100 | 80 | FP16 | Throughput | 1 | 28 |
H100 | 80 | FP16 | Latency | 2 | 28 |
A100 | 80 | FP16 | Throughput | 1 | 28 |
A100 | 80 | FP16 | Latency | 2 | 28 |
L40S PCIe | 48 | FP8 | Throughput | 1 | 20.5 |
L40S PCIe | 48 | FP8 | Latency | 2 | 20.5 |
L40S PCIe | 48 | FP16 | Throughput | 1 | 28 |
A10G PCIe | 24 | FP16 | Throughput | 1 | 28 |
A10G PCIe | 24 | FP16 | Latency | 2 | 28 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs | GPU Memory | Precision | Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory and compute capability > 7.0 or higher | 24 | FP16 | 16 |
Llama 3 70B Instruct
Optimized configurations
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model; Profile is for what the model is optimized.
GPU | GPU Memory | Precision | Profile | # of GPUs | Disk Space |
---|---|---|---|---|---|
H100 | 80 | FP8 | Throughput | 4 | 82 |
H100 | 80 | FP8 | Latency | 8 | 82 |
H100 | 80 | FP16 | Throughput | 4 | 158 |
H100 | 80 | FP16 | Latency | 8 | 158 |
A100 | 80 | FP16 | Throughput | 4 | 158 |
L40S PCIe | 48 | FP8 | Throughput | 4 | 82 |
L40S PCIe | 48 | FP8 | Latency | 8 | 82 |
L40S PCIe | 48 | FP16 | Throughput | 8 | 158 |
Non-optimized configuration
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs | GPU Memory | Precision | Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory | 240 | FP16 | 100 |
The following LoRA formats are supported:
Foundation Model | HuggingFace Format | NeMo Format |
---|---|---|
Meta-Llama3-8b-Instruct | Yes | Yes |
Meta-Llama3-70b-Instruct | Yes | Yes |