NVIDIA NIM for LLMs uses the TensorRT-LLM backend for the performance-optimized approach. With this backend, the Operator supports the following models and hardware.
Model |
GPU Requirements |
|
---|---|---|
Llama-2-13b-chat (default) | 2 \(\times\) A100 80 GB SXM | 2 \(\times\) H100 80 GB SXM |
Mixtral-8x7B-v0.1 | 4 \(\times\) A100 80 GB SXM | 4 \(\times\) H100 80 GB SXM |
When NVIDIA NIM for LLMs is deployed to use the vLLM backend, the Operator supports the following models and hardware. The vLLM backend supports both SXM and PCIe.
Model |
GPU Requirements |
||
---|---|---|---|
Llama-2-7b-chat | 1 \(\times\) L40S 48 GB | 1 \(\times\) A100 80 GB | 1 \(\times\) H100 80 GB |
Llama-2-13b-chat (default) | 1 \(\times\) L40S 48 GB | 1 \(\times\) A100 80 GB | 1 \(\times\) H100 80 GB |
Llama-2-70b-chat | 8 \(\times\) L40S 48 GB | 4 \(\times\) A100 80 GB | 4 \(\times\) H100 80 GB |
Mistral-7B-Instruct-v0.2 | 1 \(\times\) L40S 48 GB | 1 \(\times\) A100 80 GB | 1 \(\times\) H100 80 GB |
Mixtral-8x7B-Instruct-v0.1 | 4 \(\times\) L40S 48 GB | 2 \(\times\) A100 80 GB | 2 \(\times\) H100 80 GB |
The Operator supports the following GPUs for embedding. The following section covers the constraints for inference that depend on inference model, GPU model.
NVIDIA H100
NVIDIA A100 80 GB
NVIDIA L40S
NVIDIA vGPU 17
Operating System |
Kubernetes |
VMware vSphere with Tanzu |
---|---|---|
Ubuntu 22.04 | 1.26—1.28 | 8.0 Update 2 |
Operating System |
containerd |
---|---|
Ubuntu 22.04 | 1.6, 1.7 |