Google Cloud#

NIM supports deployment on Google Cloud Platform (GCP) through two approaches:

  • Google Kubernetes Engine (GKE): Self-managed Kubernetes deployment using Helm charts with complete control over infrastructure, GPU node pools, and persistent storage.

  • Vertex AI: Fully managed ML platform with built-in load balancing, auto-scaling, and automatic authentication.

Both deployment strategies can use Google Cloud Storage (GCS) as the backend for model storage. The following table compares GKE and Vertex AI across common deployment considerations:

Aspect

GKE (Helm)

Vertex AI

Infrastructure Management

User manages cluster, nodes, and GPUs

Fully managed by Google

Deployment Mechanism

helm install command

gcloud ai models upload and gcloud ai endpoints deploy-model

Container Image Source

Any registry (nvcr.io, GitLab, Artifact Registry)

Artifact Registry only

Model Artifact Source

NGC, GCS, Hugging Face, local path

Google Cloud Storage (GCS) only

Authentication Methods

NGC key, Service Account, ADC, Workload Identity

Application Default Credentials (ADC) only

Scaling

Manual or Horizontal Pod Autoscaler (HPA)

Built-in automatic autoscaling

Cost Model

Node-based (cost incurred even when idle)

Consumption-based with scale-to-zero support

Ideal Use Case

Full control, custom environments, development

Quick deployment, managed infrastructure, production

GKE Deployment#

To deploy NIM LLM on Google Kubernetes Engine (GKE), create a GKE cluster with GPU-enabled node pools and prepare your Google Cloud environment for NIM LLM workloads. This section covers prerequisites, cluster creation, and initial setup.

Prerequisites#

Before proceeding, ensure you have the following:

  • Google Cloud Project: A GCP project with permissions for creating GKE clusters and GPU node pools

  • GPU Quota: Sufficient GPU quota in your target region for your chosen GPU type (check in Cloud Console under Quotas)

  • Required Tools:

    • Google Cloud SDK

    • kubectl (install with gcloud components install kubectl)

    • Helm 3

    • gke-gcloud-auth-plugin

Initial Setup#

Complete the following setup steps before you deploy NIM on GKE.

Authenticate and Enable APIs#

Authenticate to Google Cloud, select your project, and enable the required services.

  1. Authenticate with the Google Cloud CLI:

    gcloud auth login
    
  2. Set the active project:

    gcloud config set project ${YOUR_PROJECT_ID}
    
  3. Authenticate Application Default Credentials:

    gcloud auth application-default login
    
  4. Enable the required APIs:

    gcloud services enable container.googleapis.com
    gcloud services enable compute.googleapis.com
    gcloud services enable storage.googleapis.com
    gcloud services enable artifactregistry.googleapis.com
    

Set Environment Variables#

Set environment variables for your GCP project, region, and NIM deployment. Update the placeholder values with your own:

export GCP_PROJECT="${YOUR_PROJECT_ID}"
export GCP_REGION="${YOUR_REGION}"        # for example, europe-west4, us-central1
export GCP_ZONE="${YOUR_ZONE}"            # for example, europe-west4-a
export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export GCS_BUCKET="${YOUR_BUCKET_NAME}"
export GKE_CLUSTER="${YOUR_CLUSTER_NAME}"
export NODE_POOL="gpu-pool"
export NAMESPACE="default"

Create GKE Cluster#

Create the network and GKE control plane before you add a GPU node pool for NIM LLM workloads.

Create VPC Network#

Create a dedicated VPC network for the GKE cluster:

gcloud compute networks create nim-network --subnet-mode=auto

Create Cluster#

Create the GKE cluster in that network:

gcloud container clusters create $GKE_CLUSTER \
  --project $GCP_PROJECT \
  --zone $GCP_ZONE \
  --network=nim-network \
  --subnetwork=nim-network \
  --machine-type e2-standard-4 \
  --num-nodes 1 \
  --release-channel regular \
  --image-type COS_CONTAINERD

Add GPU Node Pool#

NIM requires GPU nodes. First, check available GPU types in your zone:

gcloud compute accelerator-types list --filter="zone:$GCP_ZONE" --format="table(name,zone)"

Select a GPU type based on your model size:

GPU

Machine Type

GPU Memory

Target Model Size

NVIDIA L4

g2-standard-8

24 GB

Up to 8B parameters

NVIDIA A100 40GB

a2-highgpu-1g

40 GB

13B to 30B parameters

NVIDIA A100 80GB

a2-ultragpu-1g

80 GB

Up to 70B parameters

NVIDIA H100 80GB

a3-highgpu-8g

8 x 80 GB

Up to 405B parameters

Create the GPU node pool:

gcloud container node-pools create $NODE_POOL \
  --cluster $GKE_CLUSTER \
  --project $GCP_PROJECT \
  --zone $GCP_ZONE \
  --machine-type g2-standard-8 \
  --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
  --num-nodes 1 \
  --image-type COS_CONTAINERD \
  --scopes cloud-platform \
  --disk-size 200

Note

Setting gpu-driver-version=latest triggers automatic NVIDIA driver installation, eliminating the need for a separate DaemonSet.

Get Cluster Credentials#

This command fetches the cluster endpoint and authentication credentials from GCP and writes them to your local kubeconfig file (~/.kube/config).

gcloud container clusters get-credentials $GKE_CLUSTER \
  --zone $GCP_ZONE --project $GCP_PROJECT

Afterward, kubectl commands target your new cluster.

Verify GPU Availability#

kubectl get nodes -o custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu,STATUS:.status.conditions[-1].type'

You should observe at least one node with a GPU count of 1.

Create Secrets#

NGC Registry Secret#

kubectl create secret docker-registry ngc-secret \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

Optional: Hugging Face Token for Gated Models#

kubectl create secret generic hf-token \
  --namespace $NAMESPACE \
  --from-literal=HF_TOKEN="${YOUR_HF_TOKEN}"

Reference in Helm with model.hfTokenSecret: hf-token.

Helm Deployment#

Create one of the following values files based on whether you are deploying a model-specific image, a generic image, or a model stored in GCS.

Model-Specific NIM#

Use this option for NIM images built for a specific model, such as Llama 3.1 8B Instruct.

Create values-gke-prebuilt.yaml:

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.1"
  pullPolicy: IfNotPresent

model:
  name: meta/llama3.1-8b-instruct
  ngcAPISecret: ngc-api
  nimCache: /model-store
  openaiPort: 8000
  logLevel: INFO

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi
  storageClass: premium-rwo

imagePullSecrets:
  - name: ngc-secret

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

service:
  type: LoadBalancer
  openaiPort: 8000

Model-Free NIM with GCS Storage#

Use this option to serve custom or fine-tuned models stored in GCS with standard directory structure.

Create values-gke-model-free.yaml:

image:
  repository: <NIM_LLM_MODEL_FREE_IMAGE>
  tag: "2.0.1"
  pullPolicy: IfNotPresent

model:
  name: my-custom-model
  ngcAPISecret: ngc-api
  nimCache: /model-store
  openaiPort: 8000
  logLevel: INFO

env:
  - name: NIM_MODEL_PATH
    value: "gs://${GCS_BUCKET}/my-org/my-model"
  - name: NIM_SERVED_MODEL_NAME
    value: "my-custom-model"
  - name: NIM_MAX_MODEL_LEN
    value: "4096"

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi
  storageClass: premium-rwo

imagePullSecrets:
  - name: ngc-secret

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

service:
  type: LoadBalancer
  openaiPort: 8000

Note

For NIM_MODEL_PATH with GCS, ensure your model is uploaded with standard directory structure, not percent-encoded keys. Refer to GCS Use Cases: Model-Free vs Repository Override for upload instructions.

Note

GKE pods using Workload Identity or node service accounts with GCS access do not require explicit credentials. For local testing with a service account key, add:

env:
  - name: GOOGLE_APPLICATION_CREDENTIALS
    value: "/credentials/sa.json"
volumeMounts:
  - name: gcp-credentials
    mountPath: /credentials
    readOnly: true
volumes:
  - name: gcp-credentials
    secret:
      secretName: gcp-sa-key

Install the Helm Release#

export RELEASE_NAME="my-nim"

helm upgrade --install $RELEASE_NAME ./helm \
  --namespace $NAMESPACE \
  -f values-gke-prebuilt.yaml \
  --timeout 45m \
  --wait

For the generic image, use -f values-gke-vllm-oss.yaml instead.

Monitor Deployment#

kubectl -n $NAMESPACE get pods -l "app.kubernetes.io/name=nim-llm" -w

Wait until the pod status is Running and ready.

Configure Network Access#

GKE often blocks non-standard ports. Create a firewall rule:

gcloud compute firewall-rules create allow-nim-8000 \
  --network=nim-network \
  --allow=tcp:8000 \
  --source-ranges=0.0.0.0/0 \
  --description="Allow NIM API port 8000"

Test the GKE Deployment#

Get the Endpoint#

For LoadBalancer service, wait for the external IP to be assigned:

kubectl -n $NAMESPACE get svc -l "app.kubernetes.io/name=nim-llm" -w
# Wait until EXTERNAL-IP changes from <pending> to an IP address, then Ctrl+C

export NIM_IP=$(kubectl -n $NAMESPACE get svc -l "app.kubernetes.io/name=nim-llm" \
  -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
echo "NIM endpoint: http://$NIM_IP:8000"

For ClusterIP service, use port-forward:

kubectl -n $NAMESPACE port-forward svc/${RELEASE_NAME}-nim-llm 8000:8000

Health Check#

Confirm the NIM API is up and ready:

curl -s "http://$NIM_IP:8000/v1/health/ready"

Test Chat Completions#

Send a test chat completion request and verify inference is working:

curl -X POST "http://$NIM_IP:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta/llama3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256}'

Run Helm Tests#

Run the suite of Helm chart tests for your release:

helm test $RELEASE_NAME -n $NAMESPACE

View Logs#

View real-time logs for your deployed NIM pods:

kubectl -n $NAMESPACE logs -l "app.kubernetes.io/name=nim-llm" -f

GKE Model Sourcing Options#

NIM on GKE supports multiple model sources:

Source

Method

Environment Variable

GKE Support

NGC (built-in manifest)

Container downloads from NGC at startup

NGC_API_KEY (K8s secret)

Yes

NGC direct

NIM generates manifest, downloads from NGC

NIM_MODEL_PATH=ngc://org/model:ver

Yes

NGC mirrored to GCS

Model files in GCS with percent-encoded keys

NIM_REPOSITORY_OVERRIDE=gs://bucket

Yes (refer to Model Storage Options)

Hugging Face direct

NIM generates a manifest and downloads from Hugging Face

NIM_MODEL_PATH=hf://org/model

Yes

Local path

Model files pre-loaded on PVC or hostPath

NIM_MODEL_PATH=/opt/nim/models/

Yes

GKE Authentication#

GKE supports multiple authentication methods for GCS access:

Method

Use Case

Configuration

Application Default Credentials (ADC)

Local development

Run gcloud auth application-default login

Workload Identity

Production GKE

Link a Kubernetes (K8s) service account to a GCP service account

Service Account JSON

CI/CD pipelines

Set GOOGLE_APPLICATION_CREDENTIALS

Note

Workload Identity is the recommended approach for production GKE deployments. It eliminates the need for credential files in pods.

GKE Cleanup#

Uninstall Helm Release#

helm uninstall $RELEASE_NAME -n $NAMESPACE

Delete PVCs#

kubectl delete pvc -n $NAMESPACE -l app.kubernetes.io/name=nim-llm

Delete GKE Cluster#

gcloud container clusters delete $GKE_CLUSTER --zone=$GCP_ZONE --quiet

Delete Firewall Rule#

gcloud compute firewall-rules delete allow-nim-8000 --quiet

Delete VPC Network#

gcloud compute networks delete nim-network --quiet

Vertex AI Deployment#

Vertex AI is a fully managed ML platform that eliminates the need for infrastructure management. It provides built-in load balancing, auto-scaling, and authentication through Application Default Credentials (ADC).

Key Constraints#

Vertex AI has specific requirements that differ from GKE deployments:

Aspect

Requirement

Container Images

Must be sourced from Artifact Registry (direct pulls from nvcr.io or GitLab are not supported)

Model Storage

Must be stored in Google Cloud Storage (GCS) using the gs:// protocol

Authentication

Uses Application Default Credentials (ADC) automatically

Prerequisites#

Before proceeding, ensure you have the following:

  • GCP Project: A project with permissions for Vertex AI, Artifact Registry, and GCS

  • GPU Quota: Sufficient GPU quota in your region for your chosen GPU type (check in Cloud Console under Quotas)

  • Required Tools:

    • Google Cloud SDK

    • Docker (for pushing images to Artifact Registry)

    • Optional: pip install google-cloud-aiplatform for Python SDK access

Initial Setup#

Complete the following setup steps before you deploy NIM on Vertex AI.

Authenticate and Enable APIs#

Authenticate to Google Cloud, select your project, and enable the required services.

  1. Authenticate with the Google Cloud CLI:

    gcloud auth login
    
  2. Set the active project:

    gcloud config set project ${YOUR_PROJECT_ID}
    
  3. Authenticate Application Default Credentials:

    gcloud auth application-default login
    
  4. Enable the required APIs:

    gcloud services enable aiplatform.googleapis.com
    gcloud services enable artifactregistry.googleapis.com
    gcloud services enable storage.googleapis.com
    gcloud services enable compute.googleapis.com
    

Note

The API must be enabled before proceeding. If you encounter errors indicating that the API is disabled, run the gcloud services enable commands and wait one to two minutes.

Set Environment Variables#

export GCP_PROJECT="${YOUR_PROJECT_ID}"
export GCP_REGION="${YOUR_REGION}"              # for example, europe-west4, us-central1
export GCS_BUCKET="${YOUR_GCS_BUCKET}"          # for model storage
export AR_REPO="${YOUR_AR_REPO}" # for example, nim-repo
export AR_IMAGE="${GCP_REGION}-docker.pkg.dev/${GCP_PROJECT}/${AR_REPO}/nim-llm"
export AR_TAG="2.0.2"

Create GCS Bucket and Grant Access#

Create the GCS bucket and grant Vertex AI permission to read from it.

  1. Create the bucket:

    gsutil mb -l $GCP_REGION -p $GCP_PROJECT gs://$GCS_BUCKET
    
  2. Create the Vertex AI service identity if this is your first Vertex AI deployment in the project:

    gcloud beta services identity create --service=aiplatform.googleapis.com --project=$GCP_PROJECT
    
  3. Get the project number:

    PROJECT_NUMBER=$(gcloud projects describe $GCP_PROJECT --format="value(projectNumber)")
    
  4. Grant the Vertex AI service agent access to the bucket:

    gsutil iam ch \
      serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com:objectViewer \
      gs://$GCS_BUCKET
    

Note

If deployment fails with a permission error for custom-online-prediction@TENANT_ID-tp.iam.gserviceaccount.com, extract the TENANT_ID from the error and grant access:

export TENANT_ID=${TENANT_ID_FROM_ERROR}
gsutil iam ch \
  serviceAccount:custom-online-prediction@${TENANT_ID}-tp.iam.gserviceaccount.com:objectViewer \
  gs://$GCS_BUCKET

Create Artifact Registry Repository#

Create the Artifact Registry repository, grant Vertex AI access, and configure Docker authentication.

  1. Create the repository:

    gcloud artifacts repositories create $AR_REPO \
      --repository-format=docker \
      --location=$GCP_REGION \
      --project=$GCP_PROJECT
    
  2. Get the project number:

    PROJECT_NUMBER=$(gcloud projects describe $GCP_PROJECT --format="value(projectNumber)")
    
  3. Grant the Vertex AI service agent read access to Artifact Registry:

    gcloud projects add-iam-policy-binding $GCP_PROJECT \
      --member="serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com" \
      --role="roles/artifactregistry.reader"
    
  4. Configure Docker authentication for Artifact Registry:

    gcloud auth configure-docker ${GCP_REGION}-docker.pkg.dev
    

Push NIM Image to Artifact Registry#

Vertex AI cannot pull images from external registries. You must re-tag and push the NIM image to Artifact Registry:

  1. Log in to NGC:

    docker login nvcr.io
    
  2. Pull the NIM image from NGC:

    docker pull ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1
    
  3. Re-tag the image for Artifact Registry:

    docker tag ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1 ${AR_IMAGE}:${AR_TAG}
    
  4. Push the image to Artifact Registry:

    docker push ${AR_IMAGE}:${AR_TAG}
    

Upload Model to GCS#

Models must be stored in GCS for Vertex AI. Refer to Model Storage Options for upload instructions.

Register Model in Vertex AI#

For NIM_REPOSITORY_OVERRIDE#

gcloud ai models upload --region=$GCP_REGION \
  --display-name=nim-llm-llama-8b \
  --container-image-uri=${AR_IMAGE}:${AR_TAG} \
  --container-ports=8000 \
  --container-health-route=/v1/health/ready \
  --container-predict-route=/v1/chat/completions \
  --container-env-vars="NIM_REPOSITORY_OVERRIDE=gs://${GCS_BUCKET}"

For NIM_MODEL_PATH#

gcloud ai models upload --region=$GCP_REGION \
  --display-name=nim-llm-llama-8b \
  --container-image-uri=${AR_IMAGE}:${AR_TAG} \
  --container-ports=8000 \
  --container-health-route=/v1/health/ready \
  --container-predict-route=/v1/chat/completions \
  --container-env-vars="NIM_MODEL_PATH=gs://${GCS_BUCKET}/llama-3.1-8b"

Note

For NIM_MODEL_PATH, use a generic model-free image. Model-specific images might fail with “Model URI missing version suffix”.

Create Endpoint and Deploy#

Important

NIM LLM requires a GPU driver that supports CUDA 13.0 or later. If the default GPU driver on Vertex AI is too old for your deployment, startup can fail with RuntimeError: The NVIDIA driver on your system is too old.

If you need to override the default GPU driver version, use gcloud beta ai endpoints deploy-model with the --min-gpu-driver-version flag. The GA gcloud ai endpoints deploy-model command does not support this flag.

To create a Vertex AI endpoint and deploy your NIM model, complete the following steps in the same shell session:

  1. Create the endpoint:

    gcloud ai endpoints create --region=$GCP_REGION --display-name=nim-llm-endpoint
    
  2. Get the endpoint and model IDs:

    ENDPOINT_ID=$(gcloud ai endpoints list --region=$GCP_REGION \
      --filter="displayName:nim-llm-endpoint" --format="value(name)" | awk -F/ '{print $NF}')
    MODEL_ID=$(gcloud ai models list --region=$GCP_REGION \
      --filter="displayName:nim-llm-llama-8b" --format="value(name)" | awk -F/ '{print $NF}')
    
  3. Deploy the model to the endpoint. Use one of the following commands:

    • Standard workflow:

      gcloud ai endpoints deploy-model $ENDPOINT_ID --region=$GCP_REGION \
        --model=$MODEL_ID \
        --display-name=nim-llm-v1 \
        --machine-type=g2-standard-8 \
        --accelerator=type=nvidia-l4,count=1 \
        --min-replica-count=1 \
        --max-replica-count=3 \
        --traffic-split=0=100
      
    • Override the minimum GPU driver version:

      gcloud beta ai endpoints deploy-model $ENDPOINT_ID --region=$GCP_REGION \
        --model=$MODEL_ID \
        --display-name=nim-llm-v1 \
        --machine-type=g2-standard-8 \
        --accelerator=type=nvidia-l4,count=1 \
        --min-gpu-driver-version=580.65.06 \
        --min-replica-count=1 \
        --max-replica-count=3 \
        --traffic-split=0=100
      

Test the Vertex AI Endpoint#

Vertex AI endpoints are accessed through the Vertex AI REST API using rawPredict:

ENDPOINT_URL="https://${GCP_REGION}-aiplatform.googleapis.com/v1/projects/${GCP_PROJECT}/locations/${GCP_REGION}/endpoints/${ENDPOINT_ID}"

curl -X POST "${ENDPOINT_URL}:rawPredict" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d '{"model":"meta/llama3.1-8b-instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

Vertex AI Cleanup#

Undeploy and Delete Resources#

To programmatically clean up your deployed Vertex AI resources and remove related artifacts, you can use the following commands:

# Set your environment variables
export MODEL_DISPLAY_NAME="nim-llm-llama-8b"
export ENDPOINT_DISPLAY_NAME="nim-llm-endpoint"

# Find endpoint ID
ENDPOINT_ID=$(gcloud ai endpoints list --region=$GCP_REGION --project=$GCP_PROJECT \
  --filter="displayName:${ENDPOINT_DISPLAY_NAME}" --format='value(name)' | head -1)

# Undeploy model from endpoint
if [[ -n "$ENDPOINT_ID" ]]; then
  DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
    --region=$GCP_REGION --project=$GCP_PROJECT \
    --format='value(deployedModels.id)' | head -1)

  if [[ -n "$DEPLOYED_MODEL_ID" ]]; then
    gcloud ai endpoints undeploy-model $ENDPOINT_ID \
      --region=$GCP_REGION \
      --project=$GCP_PROJECT \
      --deployed-model-id=$DEPLOYED_MODEL_ID \
      --quiet
  fi
fi

# Delete endpoint
if [[ -n "$ENDPOINT_ID" ]]; then
  gcloud ai endpoints delete $ENDPOINT_ID \
    --region=$GCP_REGION \
    --project=$GCP_PROJECT \
    --quiet
fi

# Find and delete model from registry
MODEL_ID=$(gcloud ai models list --region=$GCP_REGION --project=$GCP_PROJECT \
  --filter="displayName:${MODEL_DISPLAY_NAME}" --format='value(name)' | head -1)

if [[ -n "$MODEL_ID" ]]; then
  gcloud ai models delete $MODEL_ID \
    --region=$GCP_REGION \
    --project=$GCP_PROJECT \
    --quiet
fi

# Delete Artifact Registry image
gcloud artifacts docker images delete "${AR_IMAGE}:${AR_TAG}" \
  --project=$GCP_PROJECT \
  --quiet

Optional: Delete Infrastructure#

To fully clean up your project resources—including removing the Artifact Registry repository and deleting all objects and the bucket itself from Google Cloud Storage—run the following commands:

# Delete Artifact Registry repository
gcloud artifacts repositories delete $AR_REPO \
  --location=$GCP_REGION \
  --project=$GCP_PROJECT \
  --quiet

# Delete all objects in the GCS bucket and remove the bucket
gsutil -m rm -r "gs://${GCS_BUCKET}/**"
gsutil rb "gs://${GCS_BUCKET}"

GCS Use Cases: Model-Free vs Repository Override#

There are two distinct ways to serve models from GCS. They use different GCS key formats, different upload workflows, and serve fundamentally different purposes.

Use Case 1: NIM_REPOSITORY_OVERRIDE (NGC Mirror)#

Scenario: You have an NGC model with a built-in manifest containing BLAKE3 checksums, and you want to serve it from GCS instead of NGC at runtime. This is not model-free mode. The container ships with a model-specific manifest.

  • Upload: Use the upload-to-gcs.sh script. The script percent-encodes each NGC URI (without the ngc:// scheme) into a flat GCS key, and uploads the files by using gsutil.

  • GCS key format: Flat percent-encoded keys. Each file becomes a single object with no directory structure:

    nim%2Fmeta%2Fllama-3.1-8b-instruct%3Ahf%3Ffile%3Dconfig.json
    nim%2Fmeta%2Fllama-3.1-8b-instruct%3Ahf%3Ffile%3Dmodel.safetensors
    
  • Download: At runtime, NIM LLM reads the built-in manifest (still has ngc:// URIs), sees NIM_REPOSITORY_OVERRIDE=gs://bucket, percent-encodes each URI, rewrites it to gs://bucket/<encoded-key>, downloads from GCS, and verifies BLAKE3 checksums from the original manifest.

  • When to use: Air-gapped environments with no NGC access at runtime, enterprise GCS mirrors of NGC models, and latency-sensitive deployments where GCS is closer than NGC.

Step 1: Download model to local cache

First, download the model files from NGC to your local cache:

export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export NIM_CACHE_PATH=/tmp/nim-cache

docker run --rm \
  -e NGC_API_KEY \
  -e NIM_CACHE_PATH=/opt/nim/.cache \
  -v ${NIM_CACHE_PATH}:/opt/nim/.cache \
  ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1 \
  download-to-cache --all

Step 2: Upload with percent encoding

Create an upload-to-gcs.sh script that handles percent-encoding:

#!/bin/bash
# Script: upload-to-gcs.sh
# Upload model files to GCS with percent-encoded keys for NIM_REPOSITORY_OVERRIDE.
#
# Usage: ./upload-to-gcs.sh /path/to/local/model/store gs://mybucket "org/model:version?file="
#
# IMPORTANT: MODEL_PREFIX must NOT include "ngc://". nim-sdk strips the scheme
# before encoding, so the GCS key must not contain it either.
# Correct:   "nim/meta/llama-3.1-8b-instruct:hf?file="
# Incorrect: "ngc://nim/meta/llama-3.1-8b-instruct:hf?file="

set -euo pipefail

LOCAL_DIR=$1        # Directory containing model files
GCS_BUCKET=$2       # GCS bucket (for example, gs://mybucket)
MODEL_PREFIX=$3     # URI prefix WITHOUT ngc:// scheme (for example, nim/meta/model:v1?file=)

# Guard against accidental ngc:// inclusion
if [[ "$MODEL_PREFIX" == ngc://* ]]; then
    echo "ERROR: MODEL_PREFIX must not include ngc:// -- nim-sdk strips the scheme."
    echo "  Got: $MODEL_PREFIX"
    echo "  Use: ${MODEL_PREFIX#ngc://}"
    exit 1
fi

STAGING_DIR=$(mktemp -d)
trap "rm -rf $STAGING_DIR" EXIT

echo "Staging files with percent-encoded keys..."
find "$LOCAL_DIR" -type f | while read -r file; do
    # Get relative path from local directory
    rel_path="${file#$LOCAL_DIR/}"

    # Build the full source URI
    raw_key="${MODEL_PREFIX}${rel_path}"

    # Percent-encode the key
    encoded_key=$(python3 -c "import sys; from urllib.parse import quote; print(quote(sys.argv[1], safe=''))" "$raw_key")

    # Stage the file with encoded name
    mkdir -p "$(dirname "$STAGING_DIR/$encoded_key")"
    cp "$file" "$STAGING_DIR/$encoded_key"
done

echo "Uploading all encoded files to $GCS_BUCKET ..."
gsutil -m cp -r "$STAGING_DIR/"* "$GCS_BUCKET/"

echo "Upload complete!"

Run the script:

chmod +x upload-to-gcs.sh
./upload-to-gcs.sh ${NIM_CACHE_PATH} gs://${GCS_BUCKET} "nim/meta/llama-3.1-8b-instruct:hf?file="

Important

The MODEL_PREFIX must not include the ngc:// scheme. nim-sdk strips the scheme internally when computing GCS keys, so including it in the upload produces keys that do not match what the download expects. To find the correct prefix, check the file URIs in the container’s built-in manifest and strip ngc:// from the front.

Run NIM:

Note

For production deployments, use Workload Identity (GKE) or Application Default Credentials (ADC) instead of mounting service account key files. The example below uses a mounted key for local testing only.

docker run --gpus=all \
  -v $(pwd)/local_cache:/opt/nim/.cache \
  -e NIM_REPOSITORY_OVERRIDE=gs://my-bucket \
  -e GOOGLE_APPLICATION_CREDENTIALS=/credentials/sa.json \
  -v /path/to/sa.json:/credentials/sa.json:ro \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.1

Use Case 2: NIM_MODEL_PATH (Model-Free)#

Scenario: You have your own fine-tuned model already in GCS with normal directory structure. The model was never on NGC. This is model-free mode. The container is generic and serves whatever model you point it to.

This approach has the following characteristics:

  • Upload: Use any GCS tool with no special encoding:

    gsutil cp -r ./my-model/ gs://my-bucket/my-org/my-model/
    
  • GCS key format: Standard directory-style keys, exactly as uploaded:

    my-org/my-model/config.json
    my-org/my-model/model.safetensors
    my-org/my-model/tokenizer.json
    
  • Manifest generation: At startup, NIM LLM dynamically generates a model manifest based on the specified model.

  • When to use: Customer fine-tuned models in GCS, no NGC involvement, models uploaded with standard directory structure.

URI Format: gs://{bucket}/{path}/{model-name}

Example URI: gs://my-bucket/my-org/my-fine-tuned-model

Upload your model:

gsutil cp -r ./my-model/ gs://my-bucket/my-org/my-model/

Run NIM:

docker run --gpus=all \
  -v $(pwd)/local_cache:/opt/nim/.cache \
  -e NIM_MODEL_PATH=gs://my-bucket/my-org/my-model \
  -e GOOGLE_APPLICATION_CREDENTIALS=/credentials/sa.json \
  -v /path/to/sa.json:/credentials/sa.json:ro \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1

For GKE or Vertex AI (using ADC, no credentials file needed):

docker run --gpus=all \
  -v $(pwd)/local_cache:/opt/nim/.cache \
  -e NIM_MODEL_PATH=gs://my-bucket/my-org/my-model \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.1

Comparison#

Aspect

NIM_REPOSITORY_OVERRIDE

NIM_MODEL_PATH=gs://

Purpose

Serve NGC model from GCS

Serve custom model from GCS

Model-free mode?

No (model-specific container)

Yes (generic container)

Model origin

NGC (has pre-built manifest + checksums)

Customer-uploaded (no manifest)

Upload tool

upload-to-gcs.sh or manual percent-encoding

gsutil cp or any tool

GCS key format

Percent-encoded flat keys

Normal directory structure

Manifest

Built into container at build time

Generated at runtime by nimlib

Manifest URI protocol

ngc:// (rewritten to gs:// at runtime)

gsrepo://

Checksums

BLAKE3 from NGC (pre-computed)

BLAKE3 computed during manifest generation

Double download?

No (checksums come from the file, so download happens one time)

No (computed from cache and served from cache)

NIM image type

Pre-built or generic

Generic only

Important

These two approaches use incompatible key formats. Files uploaded with percent-encoding cannot be consumed by NIM_MODEL_PATH=gs:// (expects directory structure), and vice versa.

Authentication#

GCS supports Application Default Credentials (ADC). Authentication is resolved in the following order:

  1. GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to a service account key file

  2. Vertex AI managed environment credentials

  3. Google Compute Engine (GCE) metadata service

  4. Cloud Run environment

  5. gcloud CLI user credentials

Variable

Required

Purpose

GOOGLE_APPLICATION_CREDENTIALS

Yes (unless using ADC)

Path to service account JSON key file

GCS_ENDPOINT_URL

Only for GCS emulators

Custom endpoint (for example, http://localhost:4443 for fake-gcs)

STORAGE_EMULATOR_HOST

Only for GCS emulators

Enables anonymous access for GCS emulators

Troubleshooting#

GKE Issues#

Issue

Resolution

GKE API not enabled

Run gcloud services enable container.googleapis.com and wait 1-2 minutes

Pod status: Pending

Verify GPU node pool exists, tolerations match, and PVC is bound

Pod status: ImagePullBackOff

Check imagePullSecrets and NGC_API_KEY validity

Pod status: CrashLoopBackOff

Check logs with kubectl logs, verify environment variables

Startup probe failure

Model loading can take up to 30 minutes for large models. Increase failureThreshold if needed.

No GPU detected

Verify nvidia.com/gpu in node capacity, and ensure that gpu-driver-version=latest was set.

LoadBalancer stuck pending

Check GCP permissions. This process can take several minutes.

Port 8000 blocked

Create firewall rule to allow TCP port 8000

Vertex AI Issues#

Issue

Resolution

Vertex AI API not enabled

Run gcloud services enable aiplatform.googleapis.com and wait 1-2 minutes

RuntimeError: The NVIDIA driver on your system is too old

The default Vertex AI GPU driver is too old for NIM LLM. Deploy using gcloud beta ai endpoints deploy-model with --min-gpu-driver-version=580.65.06 or later. Refer to Create Endpoint and Deploy.

Deployment stuck

Model loading takes 20 to 45 minutes. Check operation status with gcloud ai operations list --region=$GCP_REGION.

401 error on GCS bucket

Grant Vertex AI service account access to the bucket

Permission denied for custom-online-prediction

Grant the tenant project service account objectViewer access to the GCS bucket

Image pull failed

Verify image is in Artifact Registry and Vertex AI has artifactregistry.reader role

Model URI missing version suffix

Use a generic model-free image with NIM_MODEL_PATH instead of model-specific images

For additional troubleshooting, refer to Support and FAQ.

References#