Google Cloud#
NIM supports deployment on Google Cloud Platform (GCP) through two approaches:
Google Kubernetes Engine (GKE): Self-managed Kubernetes deployment using Helm charts with complete control over infrastructure, GPU node pools, and persistent storage.
Vertex AI: Fully managed ML platform with built-in load balancing, auto-scaling, and automatic authentication.
Both deployment strategies can use Google Cloud Storage (GCS) as the backend for model storage. The following table compares GKE and Vertex AI across common deployment considerations:
Aspect |
GKE (Helm) |
Vertex AI |
|---|---|---|
Infrastructure Management |
User manages cluster, nodes, and GPUs |
Fully managed by Google |
Deployment Mechanism |
|
|
Container Image Source |
Any registry (nvcr.io, GitLab, Artifact Registry) |
Artifact Registry only |
Model Artifact Source |
NGC, GCS, Hugging Face, local path |
Google Cloud Storage (GCS) only |
Authentication Methods |
NGC key, Service Account, ADC, Workload Identity |
Application Default Credentials (ADC) only |
Scaling |
Manual or Horizontal Pod Autoscaler (HPA) |
Built-in automatic autoscaling |
Cost Model |
Node-based (cost incurred even when idle) |
Consumption-based with scale-to-zero support |
Ideal Use Case |
Full control, custom environments, development |
Quick deployment, managed infrastructure, production |
GKE Deployment#
To deploy NIM LLM on Google Kubernetes Engine (GKE), create a GKE cluster with GPU-enabled node pools and prepare your Google Cloud environment for NIM LLM workloads. This section covers prerequisites, cluster creation, and initial setup.
Prerequisites#
Before proceeding, ensure you have the following:
Google Cloud Project: A GCP project with permissions for creating GKE clusters and GPU node pools
GPU Quota: Sufficient GPU quota in your target region for your chosen GPU type (check in Cloud Console under Quotas)
Required Tools:
Google Cloud SDK
kubectl(install withgcloud components install kubectl)Helm 3
gke-gcloud-auth-plugin
Initial Setup#
Complete the following setup steps before you deploy NIM on GKE.
Authenticate and Enable APIs#
Authenticate to Google Cloud, select your project, and enable the required services.
Authenticate with the Google Cloud CLI:
gcloud auth login
Set the active project:
gcloud config set project ${YOUR_PROJECT_ID}
Authenticate Application Default Credentials:
gcloud auth application-default login
Enable the required APIs:
gcloud services enable container.googleapis.com gcloud services enable compute.googleapis.com gcloud services enable storage.googleapis.com gcloud services enable artifactregistry.googleapis.com
Set Environment Variables#
Set environment variables for your GCP project, region, and NIM deployment. Update the placeholder values with your own:
export GCP_PROJECT="${YOUR_PROJECT_ID}"
export GCP_REGION="${YOUR_REGION}" # for example, europe-west4, us-central1
export GCP_ZONE="${YOUR_ZONE}" # for example, europe-west4-a
export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export GCS_BUCKET="${YOUR_BUCKET_NAME}"
export GKE_CLUSTER="${YOUR_CLUSTER_NAME}"
export NODE_POOL="gpu-pool"
export NAMESPACE="default"
Create GKE Cluster#
Create the network and GKE control plane before you add a GPU node pool for NIM LLM workloads.
Create VPC Network#
Create a dedicated VPC network for the GKE cluster:
gcloud compute networks create nim-network --subnet-mode=auto
Create Cluster#
Create the GKE cluster in that network:
gcloud container clusters create $GKE_CLUSTER \
--project $GCP_PROJECT \
--zone $GCP_ZONE \
--network=nim-network \
--subnetwork=nim-network \
--machine-type e2-standard-4 \
--num-nodes 1 \
--release-channel regular \
--image-type COS_CONTAINERD
Add GPU Node Pool#
NIM requires GPU nodes. First, check available GPU types in your zone:
gcloud compute accelerator-types list --filter="zone:$GCP_ZONE" --format="table(name,zone)"
Select a GPU type based on your model size:
GPU |
Machine Type |
GPU Memory |
Target Model Size |
|---|---|---|---|
NVIDIA L4 |
g2-standard-8 |
24 GB |
Up to 8B parameters |
NVIDIA A100 40GB |
a2-highgpu-1g |
40 GB |
13B to 30B parameters |
NVIDIA A100 80GB |
a2-ultragpu-1g |
80 GB |
Up to 70B parameters |
NVIDIA H100 80GB |
a3-highgpu-8g |
8 x 80 GB |
Up to 405B parameters |
Create the GPU node pool:
gcloud container node-pools create $NODE_POOL \
--cluster $GKE_CLUSTER \
--project $GCP_PROJECT \
--zone $GCP_ZONE \
--machine-type g2-standard-8 \
--accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
--num-nodes 1 \
--image-type COS_CONTAINERD \
--scopes cloud-platform \
--disk-size 200
Note
Setting gpu-driver-version=latest triggers automatic NVIDIA driver installation, eliminating the need for a separate DaemonSet.
Get Cluster Credentials#
This command fetches the cluster endpoint and authentication credentials from GCP and writes them to your local kubeconfig file (~/.kube/config).
gcloud container clusters get-credentials $GKE_CLUSTER \
--zone $GCP_ZONE --project $GCP_PROJECT
Afterward, kubectl commands target your new cluster.
Verify GPU Availability#
kubectl get nodes -o custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu,STATUS:.status.conditions[-1].type'
You should observe at least one node with a GPU count of 1.
Create Secrets#
NGC Registry Secret#
kubectl create secret docker-registry ngc-secret \
--namespace $NAMESPACE \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="$NGC_API_KEY"
kubectl create secret generic ngc-api \
--namespace $NAMESPACE \
--from-literal=NGC_API_KEY="$NGC_API_KEY"
Optional: Hugging Face Token for Gated Models#
kubectl create secret generic hf-token \
--namespace $NAMESPACE \
--from-literal=HF_TOKEN="${YOUR_HF_TOKEN}"
Reference in Helm with model.hfTokenSecret: hf-token.
Helm Deployment#
Create one of the following values files based on whether you are deploying a model-specific image, a generic image, or a model stored in GCS.
Model-Specific NIM#
Use this option for NIM images built for a specific model, such as Llama 3.1 8B Instruct.
Create values-gke-prebuilt.yaml:
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "2.0.1"
pullPolicy: IfNotPresent
model:
name: meta/llama3.1-8b-instruct
ngcAPISecret: ngc-api
nimCache: /model-store
openaiPort: 8000
logLevel: INFO
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
persistence:
enabled: true
size: 50Gi
storageClass: premium-rwo
imagePullSecrets:
- name: ngc-secret
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
service:
type: LoadBalancer
openaiPort: 8000
Model-Free NIM with GCS Storage#
Use this option to serve custom or fine-tuned models stored in GCS with standard directory structure.
Create values-gke-model-free.yaml:
image:
repository: <NIM_LLM_MODEL_FREE_IMAGE>
tag: "2.0.1"
pullPolicy: IfNotPresent
model:
name: my-custom-model
ngcAPISecret: ngc-api
nimCache: /model-store
openaiPort: 8000
logLevel: INFO
env:
- name: NIM_MODEL_PATH
value: "gs://${GCS_BUCKET}/my-org/my-model"
- name: NIM_SERVED_MODEL_NAME
value: "my-custom-model"
- name: NIM_MAX_MODEL_LEN
value: "4096"
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
persistence:
enabled: true
size: 50Gi
storageClass: premium-rwo
imagePullSecrets:
- name: ngc-secret
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
service:
type: LoadBalancer
openaiPort: 8000
Note
For NIM_MODEL_PATH with GCS, ensure your model is uploaded with standard directory structure, not percent-encoded keys. Refer to GCS Use Cases: Model-Free vs Repository Override for upload instructions.
Note
GKE pods using Workload Identity or node service accounts with GCS access do not require explicit credentials. For local testing with a service account key, add:
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/credentials/sa.json"
volumeMounts:
- name: gcp-credentials
mountPath: /credentials
readOnly: true
volumes:
- name: gcp-credentials
secret:
secretName: gcp-sa-key
Install the Helm Release#
export RELEASE_NAME="my-nim"
helm upgrade --install $RELEASE_NAME ./helm \
--namespace $NAMESPACE \
-f values-gke-prebuilt.yaml \
--timeout 45m \
--wait
For the generic image, use -f values-gke-vllm-oss.yaml instead.
Monitor Deployment#
kubectl -n $NAMESPACE get pods -l "app.kubernetes.io/name=nim-llm" -w
Wait until the pod status is Running and ready.
Configure Network Access#
GKE often blocks non-standard ports. Create a firewall rule:
gcloud compute firewall-rules create allow-nim-8000 \
--network=nim-network \
--allow=tcp:8000 \
--source-ranges=0.0.0.0/0 \
--description="Allow NIM API port 8000"
Test the GKE Deployment#
Get the Endpoint#
For LoadBalancer service, wait for the external IP to be assigned:
kubectl -n $NAMESPACE get svc -l "app.kubernetes.io/name=nim-llm" -w
# Wait until EXTERNAL-IP changes from <pending> to an IP address, then Ctrl+C
export NIM_IP=$(kubectl -n $NAMESPACE get svc -l "app.kubernetes.io/name=nim-llm" \
-o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
echo "NIM endpoint: http://$NIM_IP:8000"
For ClusterIP service, use port-forward:
kubectl -n $NAMESPACE port-forward svc/${RELEASE_NAME}-nim-llm 8000:8000
Health Check#
Confirm the NIM API is up and ready:
curl -s "http://$NIM_IP:8000/v1/health/ready"
Test Chat Completions#
Send a test chat completion request and verify inference is working:
curl -X POST "http://$NIM_IP:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "meta/llama3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256}'
Run Helm Tests#
Run the suite of Helm chart tests for your release:
helm test $RELEASE_NAME -n $NAMESPACE
View Logs#
View real-time logs for your deployed NIM pods:
kubectl -n $NAMESPACE logs -l "app.kubernetes.io/name=nim-llm" -f
GKE Model Sourcing Options#
NIM on GKE supports multiple model sources:
Source |
Method |
Environment Variable |
GKE Support |
|---|---|---|---|
NGC (built-in manifest) |
Container downloads from NGC at startup |
|
Yes |
NGC direct |
NIM generates manifest, downloads from NGC |
|
Yes |
NGC mirrored to GCS |
Model files in GCS with percent-encoded keys |
|
Yes (refer to Model Storage Options) |
Hugging Face direct |
NIM generates a manifest and downloads from Hugging Face |
|
Yes |
Local path |
Model files pre-loaded on PVC or hostPath |
|
Yes |
GKE Authentication#
GKE supports multiple authentication methods for GCS access:
Method |
Use Case |
Configuration |
|---|---|---|
Application Default Credentials (ADC) |
Local development |
Run |
Workload Identity |
Production GKE |
Link a Kubernetes (K8s) service account to a GCP service account |
Service Account JSON |
CI/CD pipelines |
Set |
Note
Workload Identity is the recommended approach for production GKE deployments. It eliminates the need for credential files in pods.
GKE Cleanup#
Uninstall Helm Release#
helm uninstall $RELEASE_NAME -n $NAMESPACE
Delete PVCs#
kubectl delete pvc -n $NAMESPACE -l app.kubernetes.io/name=nim-llm
Delete GKE Cluster#
gcloud container clusters delete $GKE_CLUSTER --zone=$GCP_ZONE --quiet
Delete Firewall Rule#
gcloud compute firewall-rules delete allow-nim-8000 --quiet
Delete VPC Network#
gcloud compute networks delete nim-network --quiet
Vertex AI Deployment#
Vertex AI is a fully managed ML platform that eliminates the need for infrastructure management. It provides built-in load balancing, auto-scaling, and authentication through Application Default Credentials (ADC).
Key Constraints#
Vertex AI has specific requirements that differ from GKE deployments:
Aspect |
Requirement |
|---|---|
Container Images |
Must be sourced from Artifact Registry (direct pulls from nvcr.io or GitLab are not supported) |
Model Storage |
Must be stored in Google Cloud Storage (GCS) using the |
Authentication |
Uses Application Default Credentials (ADC) automatically |
Prerequisites#
Before proceeding, ensure you have the following:
GCP Project: A project with permissions for Vertex AI, Artifact Registry, and GCS
GPU Quota: Sufficient GPU quota in your region for your chosen GPU type (check in Cloud Console under Quotas)
Required Tools:
Google Cloud SDK
Docker (for pushing images to Artifact Registry)
Optional:
pip install google-cloud-aiplatformfor Python SDK access
Initial Setup#
Complete the following setup steps before you deploy NIM on Vertex AI.
Authenticate and Enable APIs#
Authenticate to Google Cloud, select your project, and enable the required services.
Authenticate with the Google Cloud CLI:
gcloud auth login
Set the active project:
gcloud config set project ${YOUR_PROJECT_ID}
Authenticate Application Default Credentials:
gcloud auth application-default login
Enable the required APIs:
gcloud services enable aiplatform.googleapis.com gcloud services enable artifactregistry.googleapis.com gcloud services enable storage.googleapis.com gcloud services enable compute.googleapis.com
Note
The API must be enabled before proceeding. If you encounter errors indicating that the API is disabled, run the gcloud services enable commands and wait one to two minutes.
Set Environment Variables#
export GCP_PROJECT="${YOUR_PROJECT_ID}"
export GCP_REGION="${YOUR_REGION}" # for example, europe-west4, us-central1
export GCS_BUCKET="${YOUR_GCS_BUCKET}" # for model storage
export AR_REPO="${YOUR_AR_REPO}" # for example, nim-repo
export AR_IMAGE="${GCP_REGION}-docker.pkg.dev/${GCP_PROJECT}/${AR_REPO}/nim-llm"
export AR_TAG="2.0.2"
Create GCS Bucket and Grant Access#
Create the GCS bucket and grant Vertex AI permission to read from it.
Create the bucket:
gsutil mb -l $GCP_REGION -p $GCP_PROJECT gs://$GCS_BUCKET
Create the Vertex AI service identity if this is your first Vertex AI deployment in the project:
gcloud beta services identity create --service=aiplatform.googleapis.com --project=$GCP_PROJECT
Get the project number:
PROJECT_NUMBER=$(gcloud projects describe $GCP_PROJECT --format="value(projectNumber)")
Grant the Vertex AI service agent access to the bucket:
gsutil iam ch \ serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com:objectViewer \ gs://$GCS_BUCKET
Note
If deployment fails with a permission error for custom-online-prediction@TENANT_ID-tp.iam.gserviceaccount.com, extract the TENANT_ID from the error and grant access:
export TENANT_ID=${TENANT_ID_FROM_ERROR}
gsutil iam ch \
serviceAccount:custom-online-prediction@${TENANT_ID}-tp.iam.gserviceaccount.com:objectViewer \
gs://$GCS_BUCKET
Create Artifact Registry Repository#
Create the Artifact Registry repository, grant Vertex AI access, and configure Docker authentication.
Create the repository:
gcloud artifacts repositories create $AR_REPO \ --repository-format=docker \ --location=$GCP_REGION \ --project=$GCP_PROJECT
Get the project number:
PROJECT_NUMBER=$(gcloud projects describe $GCP_PROJECT --format="value(projectNumber)")
Grant the Vertex AI service agent read access to Artifact Registry:
gcloud projects add-iam-policy-binding $GCP_PROJECT \ --member="serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com" \ --role="roles/artifactregistry.reader"
Configure Docker authentication for Artifact Registry:
gcloud auth configure-docker ${GCP_REGION}-docker.pkg.dev
Push NIM Image to Artifact Registry#
Vertex AI cannot pull images from external registries. You must re-tag and push the NIM image to Artifact Registry:
Log in to NGC:
docker login nvcr.io
Pull the NIM image from NGC:
docker pull ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1
Re-tag the image for Artifact Registry:
docker tag ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1 ${AR_IMAGE}:${AR_TAG}
Push the image to Artifact Registry:
docker push ${AR_IMAGE}:${AR_TAG}
Upload Model to GCS#
Models must be stored in GCS for Vertex AI. Refer to Model Storage Options for upload instructions.
Register Model in Vertex AI#
For NIM_REPOSITORY_OVERRIDE#
gcloud ai models upload --region=$GCP_REGION \
--display-name=nim-llm-llama-8b \
--container-image-uri=${AR_IMAGE}:${AR_TAG} \
--container-ports=8000 \
--container-health-route=/v1/health/ready \
--container-predict-route=/v1/chat/completions \
--container-env-vars="NIM_REPOSITORY_OVERRIDE=gs://${GCS_BUCKET}"
For NIM_MODEL_PATH#
gcloud ai models upload --region=$GCP_REGION \
--display-name=nim-llm-llama-8b \
--container-image-uri=${AR_IMAGE}:${AR_TAG} \
--container-ports=8000 \
--container-health-route=/v1/health/ready \
--container-predict-route=/v1/chat/completions \
--container-env-vars="NIM_MODEL_PATH=gs://${GCS_BUCKET}/llama-3.1-8b"
Note
For NIM_MODEL_PATH, use a generic model-free image. Model-specific images might fail with “Model URI missing version suffix”.
Create Endpoint and Deploy#
Important
NIM LLM requires a GPU driver that supports CUDA 13.0 or later. If the default GPU driver on Vertex AI is too old for your deployment, startup can fail with RuntimeError: The NVIDIA driver on your system is too old.
If you need to override the default GPU driver version, use gcloud beta ai endpoints deploy-model with the --min-gpu-driver-version flag. The GA gcloud ai endpoints deploy-model command does not support this flag.
To create a Vertex AI endpoint and deploy your NIM model, complete the following steps in the same shell session:
Create the endpoint:
gcloud ai endpoints create --region=$GCP_REGION --display-name=nim-llm-endpoint
Get the endpoint and model IDs:
ENDPOINT_ID=$(gcloud ai endpoints list --region=$GCP_REGION \ --filter="displayName:nim-llm-endpoint" --format="value(name)" | awk -F/ '{print $NF}') MODEL_ID=$(gcloud ai models list --region=$GCP_REGION \ --filter="displayName:nim-llm-llama-8b" --format="value(name)" | awk -F/ '{print $NF}')
Deploy the model to the endpoint. Use one of the following commands:
Standard workflow:
gcloud ai endpoints deploy-model $ENDPOINT_ID --region=$GCP_REGION \ --model=$MODEL_ID \ --display-name=nim-llm-v1 \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --min-replica-count=1 \ --max-replica-count=3 \ --traffic-split=0=100
Override the minimum GPU driver version:
gcloud beta ai endpoints deploy-model $ENDPOINT_ID --region=$GCP_REGION \ --model=$MODEL_ID \ --display-name=nim-llm-v1 \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --min-gpu-driver-version=580.65.06 \ --min-replica-count=1 \ --max-replica-count=3 \ --traffic-split=0=100
Test the Vertex AI Endpoint#
Vertex AI endpoints are accessed through the Vertex AI REST API using rawPredict:
ENDPOINT_URL="https://${GCP_REGION}-aiplatform.googleapis.com/v1/projects/${GCP_PROJECT}/locations/${GCP_REGION}/endpoints/${ENDPOINT_ID}"
curl -X POST "${ENDPOINT_URL}:rawPredict" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d '{"model":"meta/llama3.1-8b-instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
Vertex AI Cleanup#
Undeploy and Delete Resources#
To programmatically clean up your deployed Vertex AI resources and remove related artifacts, you can use the following commands:
# Set your environment variables
export MODEL_DISPLAY_NAME="nim-llm-llama-8b"
export ENDPOINT_DISPLAY_NAME="nim-llm-endpoint"
# Find endpoint ID
ENDPOINT_ID=$(gcloud ai endpoints list --region=$GCP_REGION --project=$GCP_PROJECT \
--filter="displayName:${ENDPOINT_DISPLAY_NAME}" --format='value(name)' | head -1)
# Undeploy model from endpoint
if [[ -n "$ENDPOINT_ID" ]]; then
DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
--region=$GCP_REGION --project=$GCP_PROJECT \
--format='value(deployedModels.id)' | head -1)
if [[ -n "$DEPLOYED_MODEL_ID" ]]; then
gcloud ai endpoints undeploy-model $ENDPOINT_ID \
--region=$GCP_REGION \
--project=$GCP_PROJECT \
--deployed-model-id=$DEPLOYED_MODEL_ID \
--quiet
fi
fi
# Delete endpoint
if [[ -n "$ENDPOINT_ID" ]]; then
gcloud ai endpoints delete $ENDPOINT_ID \
--region=$GCP_REGION \
--project=$GCP_PROJECT \
--quiet
fi
# Find and delete model from registry
MODEL_ID=$(gcloud ai models list --region=$GCP_REGION --project=$GCP_PROJECT \
--filter="displayName:${MODEL_DISPLAY_NAME}" --format='value(name)' | head -1)
if [[ -n "$MODEL_ID" ]]; then
gcloud ai models delete $MODEL_ID \
--region=$GCP_REGION \
--project=$GCP_PROJECT \
--quiet
fi
# Delete Artifact Registry image
gcloud artifacts docker images delete "${AR_IMAGE}:${AR_TAG}" \
--project=$GCP_PROJECT \
--quiet
Optional: Delete Infrastructure#
To fully clean up your project resources—including removing the Artifact Registry repository and deleting all objects and the bucket itself from Google Cloud Storage—run the following commands:
# Delete Artifact Registry repository
gcloud artifacts repositories delete $AR_REPO \
--location=$GCP_REGION \
--project=$GCP_PROJECT \
--quiet
# Delete all objects in the GCS bucket and remove the bucket
gsutil -m rm -r "gs://${GCS_BUCKET}/**"
gsutil rb "gs://${GCS_BUCKET}"
GCS Use Cases: Model-Free vs Repository Override#
There are two distinct ways to serve models from GCS. They use different GCS key formats, different upload workflows, and serve fundamentally different purposes.
Use Case 1: NIM_REPOSITORY_OVERRIDE (NGC Mirror)#
Scenario: You have an NGC model with a built-in manifest containing BLAKE3 checksums, and you want to serve it from GCS instead of NGC at runtime. This is not model-free mode. The container ships with a model-specific manifest.
Upload: Use the
upload-to-gcs.shscript. The script percent-encodes each NGC URI (without thengc://scheme) into a flat GCS key, and uploads the files by usinggsutil.GCS key format: Flat percent-encoded keys. Each file becomes a single object with no directory structure:
nim%2Fmeta%2Fllama-3.1-8b-instruct%3Ahf%3Ffile%3Dconfig.json nim%2Fmeta%2Fllama-3.1-8b-instruct%3Ahf%3Ffile%3Dmodel.safetensors
Download: At runtime, NIM LLM reads the built-in manifest (still has
ngc://URIs), seesNIM_REPOSITORY_OVERRIDE=gs://bucket, percent-encodes each URI, rewrites it togs://bucket/<encoded-key>, downloads from GCS, and verifies BLAKE3 checksums from the original manifest.When to use: Air-gapped environments with no NGC access at runtime, enterprise GCS mirrors of NGC models, and latency-sensitive deployments where GCS is closer than NGC.
Step 1: Download model to local cache
First, download the model files from NGC to your local cache:
export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export NIM_CACHE_PATH=/tmp/nim-cache
docker run --rm \
-e NGC_API_KEY \
-e NIM_CACHE_PATH=/opt/nim/.cache \
-v ${NIM_CACHE_PATH}:/opt/nim/.cache \
${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1 \
download-to-cache --all
Step 2: Upload with percent encoding
Create an upload-to-gcs.sh script that handles percent-encoding:
#!/bin/bash
# Script: upload-to-gcs.sh
# Upload model files to GCS with percent-encoded keys for NIM_REPOSITORY_OVERRIDE.
#
# Usage: ./upload-to-gcs.sh /path/to/local/model/store gs://mybucket "org/model:version?file="
#
# IMPORTANT: MODEL_PREFIX must NOT include "ngc://". nim-sdk strips the scheme
# before encoding, so the GCS key must not contain it either.
# Correct: "nim/meta/llama-3.1-8b-instruct:hf?file="
# Incorrect: "ngc://nim/meta/llama-3.1-8b-instruct:hf?file="
set -euo pipefail
LOCAL_DIR=$1 # Directory containing model files
GCS_BUCKET=$2 # GCS bucket (for example, gs://mybucket)
MODEL_PREFIX=$3 # URI prefix WITHOUT ngc:// scheme (for example, nim/meta/model:v1?file=)
# Guard against accidental ngc:// inclusion
if [[ "$MODEL_PREFIX" == ngc://* ]]; then
echo "ERROR: MODEL_PREFIX must not include ngc:// -- nim-sdk strips the scheme."
echo " Got: $MODEL_PREFIX"
echo " Use: ${MODEL_PREFIX#ngc://}"
exit 1
fi
STAGING_DIR=$(mktemp -d)
trap "rm -rf $STAGING_DIR" EXIT
echo "Staging files with percent-encoded keys..."
find "$LOCAL_DIR" -type f | while read -r file; do
# Get relative path from local directory
rel_path="${file#$LOCAL_DIR/}"
# Build the full source URI
raw_key="${MODEL_PREFIX}${rel_path}"
# Percent-encode the key
encoded_key=$(python3 -c "import sys; from urllib.parse import quote; print(quote(sys.argv[1], safe=''))" "$raw_key")
# Stage the file with encoded name
mkdir -p "$(dirname "$STAGING_DIR/$encoded_key")"
cp "$file" "$STAGING_DIR/$encoded_key"
done
echo "Uploading all encoded files to $GCS_BUCKET ..."
gsutil -m cp -r "$STAGING_DIR/"* "$GCS_BUCKET/"
echo "Upload complete!"
Run the script:
chmod +x upload-to-gcs.sh
./upload-to-gcs.sh ${NIM_CACHE_PATH} gs://${GCS_BUCKET} "nim/meta/llama-3.1-8b-instruct:hf?file="
Important
The MODEL_PREFIX must not include the ngc:// scheme. nim-sdk strips the scheme
internally when computing GCS keys, so including it in the upload produces keys that
do not match what the download expects. To find the correct prefix, check the file URIs
in the container’s built-in manifest and strip ngc:// from the front.
Run NIM:
Note
For production deployments, use Workload Identity (GKE) or Application Default Credentials (ADC) instead of mounting service account key files. The example below uses a mounted key for local testing only.
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-e NIM_REPOSITORY_OVERRIDE=gs://my-bucket \
-e GOOGLE_APPLICATION_CREDENTIALS=/credentials/sa.json \
-v /path/to/sa.json:/credentials/sa.json:ro \
-p 8000:8000 \
${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1
Use Case 2: NIM_MODEL_PATH (Model-Free)#
Scenario: You have your own fine-tuned model already in GCS with normal directory structure. The model was never on NGC. This is model-free mode. The container is generic and serves whatever model you point it to.
This approach has the following characteristics:
Upload: Use any GCS tool with no special encoding:
gsutil cp -r ./my-model/ gs://my-bucket/my-org/my-model/
GCS key format: Standard directory-style keys, exactly as uploaded:
my-org/my-model/config.json my-org/my-model/model.safetensors my-org/my-model/tokenizer.json
Manifest generation: At startup, NIM LLM dynamically generates a model manifest based on the specified model.
When to use: Customer fine-tuned models in GCS, no NGC involvement, models uploaded with standard directory structure.
URI Format: gs://{bucket}/{path}/{model-name}
Example URI: gs://my-bucket/my-org/my-fine-tuned-model
Upload your model:
gsutil cp -r ./my-model/ gs://my-bucket/my-org/my-model/
Run NIM:
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-e NIM_MODEL_PATH=gs://my-bucket/my-org/my-model \
-e GOOGLE_APPLICATION_CREDENTIALS=/credentials/sa.json \
-v /path/to/sa.json:/credentials/sa.json:ro \
-p 8000:8000 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1
For GKE or Vertex AI (using ADC, no credentials file needed):
docker run --gpus=all \
-v $(pwd)/local_cache:/opt/nim/.cache \
-e NIM_MODEL_PATH=gs://my-bucket/my-org/my-model \
-p 8000:8000 \
${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1
Comparison#
Aspect |
NIM_REPOSITORY_OVERRIDE |
NIM_MODEL_PATH=gs:// |
|---|---|---|
Purpose |
Serve NGC model from GCS |
Serve custom model from GCS |
Model-free mode? |
No (model-specific container) |
Yes (generic container) |
Model origin |
NGC (has pre-built manifest + checksums) |
Customer-uploaded (no manifest) |
Upload tool |
|
|
GCS key format |
Percent-encoded flat keys |
Normal directory structure |
Manifest |
Built into container at build time |
Generated at runtime by nimlib |
Manifest URI protocol |
|
|
Checksums |
BLAKE3 from NGC (pre-computed) |
BLAKE3 computed during manifest generation |
Double download? |
No (checksums come from the file, so download happens one time) |
No (computed from cache and served from cache) |
NIM image type |
Pre-built or generic |
Generic only |
Important
These two approaches use incompatible key formats. Files uploaded with percent-encoding cannot be consumed by NIM_MODEL_PATH=gs:// (expects directory structure), and vice versa.
Authentication#
GCS supports Application Default Credentials (ADC). Authentication is resolved in the following order:
GOOGLE_APPLICATION_CREDENTIALSenvironment variable pointing to a service account key fileVertex AI managed environment credentials
Google Compute Engine (GCE) metadata service
Cloud Run environment
gcloud CLI user credentials
Variable |
Required |
Purpose |
|---|---|---|
|
Yes (unless using ADC) |
Path to service account JSON key file |
|
Only for GCS emulators |
Custom endpoint (for example, |
|
Only for GCS emulators |
Enables anonymous access for GCS emulators |
Troubleshooting#
GKE Issues#
Issue |
Resolution |
|---|---|
GKE API not enabled |
Run |
Pod status: Pending |
Verify GPU node pool exists, tolerations match, and PVC is bound |
Pod status: ImagePullBackOff |
Check imagePullSecrets and NGC_API_KEY validity |
Pod status: CrashLoopBackOff |
Check logs with |
Startup probe failure |
Model loading can take up to 30 minutes for large models. Increase |
No GPU detected |
Verify |
LoadBalancer stuck pending |
Check GCP permissions. This process can take several minutes. |
Port 8000 blocked |
Create firewall rule to allow TCP port 8000 |
Vertex AI Issues#
Issue |
Resolution |
|---|---|
Vertex AI API not enabled |
Run |
|
The default Vertex AI GPU driver is too old for NIM LLM. Deploy using |
Deployment stuck |
Model loading takes 20 to 45 minutes. Check operation status with |
401 error on GCS bucket |
Grant Vertex AI service account access to the bucket |
Permission denied for custom-online-prediction |
Grant the tenant project service account |
Image pull failed |
Verify image is in Artifact Registry and Vertex AI has |
Model URI missing version suffix |
Use a generic model-free image with |
For additional troubleshooting, refer to Support and FAQ.