Hardware Support for NVIDIA NIM on Google Kubernetes Engine (GKE)#
The following are the supported optimized profiles for specific hardware configurations for NVIDIA NIM on Google Kubernetes Engine (GKE).
NIM | Version | Min #GPUs required for NIM | GPU | Supported compute on GCP | |||
---|---|---|---|---|---|---|---|
Compute name in config page | #GPU on instance | Precision | Profile | ||||
meta/llama3.1-405b-instruct | 1.1.2 | 8 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Throughput |
meta/llama3.1-8b-instruct | 1.1.2 | 1 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Throughput |
2 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Latency | ||
1 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-1g | 1 | BF16 (trt-llm) | Throughput | ||
2 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-2g | 2 | BF16 (trt-llm) | Latency | ||
2 | L4 | L4-<region>-g2-standard-24 | 2 | FP16 (vllm) | Non-optimized | ||
meta/llama3.1-70b-instruct | 1.1.2 | 4 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Throughput |
8 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Latency | ||
4 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-4g | 4 | BF16 (trt-llm) | Throughput | ||
8 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-8g | 8 | BF16 (trt-llm) | Latency | ||
8 | L4 | L4-<region>-g2-standard-96 | 8 | FP16 (vllm) | Non-optimized | ||
meta/llama3-70b-instruct | 1.0.3 | 4 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Throughput |
8 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Latency | ||
4 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-4g | 4 | FP16 | Throughput | ||
8 | L4 | L4-<region>-g2-standard-96 | 8 | FP16 (vllm) | Non-optimized | ||
meta/llama3-8b-instruct | 1.0.3 | 1 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP16 | Throughput |
2 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP16 | Latency | ||
1 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-1g | 1 | FP16 | Throughput | ||
2 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-2g | 2 | FP16 | Latency | ||
2 | L4 | L4-<region>-g2-standard-24 | 2 | FP16 (vllm) | Non-optimized | ||
mistralai/mistral-7b-instruct-v.03 | 1.0.3 | 1 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Throughput |
2 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Latency | ||
1 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-1g | 1 | FP16 (trt-llm) | Throughput | ||
2 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-2g | 2 | FP16 (trt-llm) | Latency | ||
4 | L4 | L4-<region>-g2-standard-48 | 4 | FP16 (vllm) | Non-optimized | ||
mistralai/mixtral-8x7b-instruct-v0.1 | 1.0.0 | 2 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8 | 8 | FP8 (trt-llm) | Throughput |
4 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 (trt-llm) | Latency | ||
2 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-2 g | 2 | FP16 (trt-llm) | Throughput | ||
4 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-4g | 4 | FP16 (trt-llm) | Latency | ||
nvidia/nv-rerankqa-mistral-4b-v3 | 1.0.2 | 1 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP16 | |
1 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-1g | 1 | FP16 | |||
2 | L4 | L4-<region>-g2-standard-24 | 2 | FP16 | |
||
nvidia/nv-embedqa-e5-v5 | 1.0.1 | 1 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP16 | |
1 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-1g | 1 | FP16 | |||
2 | L4 | L4-<region>-g2-standard-24 | 2 | FP16 | |
||
nvidia/nv-embedqa-mistral-7b-v2 | 1.0.1 | 1 | H100 (80GB) | H100(80GB)-<region>-a3-highgpu-8g | 8 | FP8 | |
1 | A100 (80GB) | A100(80GB)-<region>-a2-ultragpu-1g | 1 | FP16 | |||
2 | L4 | L4-<region>-g2-standard-24 | 2 | FP16 | |