Hardware Support for NVIDIA NIM on Google Kubernetes Engine (GKE)#

The following are the supported optimized profiles for specific hardware configurations for NVIDIA NIM on Google Kubernetes Engine (GKE).

NIM Version Min #GPUs required for NIM GPU Supported compute on GCP
Compute name in config page #GPU on instance Precision Profile
meta/llama3.1-405b-instruct 1.1.2 8 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Throughput
meta/llama3.1-8b-instruct 1.1.2 1 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Throughput
2 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Latency
1 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-1g 1 BF16 (trt-llm) Throughput
2 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-2g 2 BF16 (trt-llm) Latency
2 L4 L4-<region>-g2-standard-24 2 FP16 (vllm) Non-optimized
meta/llama3.1-70b-instruct 1.1.2 4 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Throughput
8 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Latency
4 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-4g 4 BF16 (trt-llm) Throughput
8 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-8g 8 BF16 (trt-llm) Latency
8 L4 L4-<region>-g2-standard-96 8 FP16 (vllm) Non-optimized
meta/llama3-70b-instruct 1.0.3 4 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Throughput
8 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Latency
4 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-4g 4 FP16 Throughput
8 L4 L4-<region>-g2-standard-96 8 FP16 (vllm) Non-optimized
meta/llama3-8b-instruct 1.0.3 1 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP16 Throughput
2 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP16 Latency
1 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-1g 1 FP16 Throughput
2 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-2g 2 FP16 Latency
2 L4 L4-<region>-g2-standard-24 2 FP16 (vllm) Non-optimized
mistralai/mistral-7b-instruct-v.03 1.0.3 1 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Throughput
2 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Latency
1 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-1g 1 FP16 (trt-llm) Throughput
2 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-2g 2 FP16 (trt-llm) Latency
4 L4 L4-<region>-g2-standard-48 4 FP16 (vllm) Non-optimized
mistralai/mixtral-8x7b-instruct-v0.1 1.0.0 2 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8 8 FP8 (trt-llm) Throughput
4 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8 (trt-llm) Latency
2 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-2   g 2 FP16 (trt-llm) Throughput
4 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-4g 4 FP16 (trt-llm) Latency
nvidia/nv-rerankqa-mistral-4b-v3 1.0.2 1 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP16
1 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-1g 1 FP16
2 L4 L4-<region>-g2-standard-24 2 FP16
nvidia/nv-embedqa-e5-v5 1.0.1 1 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP16
1 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-1g 1 FP16
2 L4 L4-<region>-g2-standard-24 2 FP16
nvidia/nv-embedqa-mistral-7b-v2 1.0.1 1 H100 (80GB) H100(80GB)-<region>-a3-highgpu-8g 8 FP8
1 A100 (80GB) A100(80GB)-<region>-a2-ultragpu-1g 1 FP16
2 L4 L4-<region>-g2-standard-24 2 FP16