Google Cloud#

NIM supports deployment on Google Cloud Platform (GCP) through two approaches:

Google Kubernetes Engine (GKE): Self-managed Kubernetes deployment using Helm charts with complete control over infrastructure, GPU node pools, and persistent storage.
Vertex AI: Fully managed ML platform with built-in load balancing, auto-scaling, and automatic authentication.

Both deployment strategies can use Google Cloud Storage (GCS) as the backend for model storage. The following table compares GKE and Vertex AI across common deployment considerations:

Aspect	GKE (Helm)	Vertex AI
Infrastructure Management	User manages cluster, nodes, and GPUs	Fully managed by Google
Deployment Mechanism	`helm install` command	`gcloud ai models upload` and `gcloud ai endpoints deploy-model`
Container Image Source	Any registry (nvcr.io, GitLab, Artifact Registry)	Artifact Registry only
Model Artifact Source	NGC, GCS, Hugging Face, local path	NGC or Hugging Face image manifest (with outbound access), or optional GCS mirror
Authentication Methods	NGC key, Service Account, ADC, Workload Identity	Application Default Credentials (ADC) only
Scaling	Manual or Horizontal Pod Autoscaler (HPA)	Built-in automatic autoscaling
Cost Model	Node-based (cost incurred even when idle)	Consumption-based with scale-to-zero support
Ideal Use Case	Full control, custom environments, development	Quick deployment, managed infrastructure, production

GKE Deployment#

To deploy NIM LLM on Google Kubernetes Engine (GKE), create a GKE cluster with GPU-enabled node pools and prepare your Google Cloud environment for NIM LLM workloads. This section covers prerequisites, cluster creation, and initial setup.

Prerequisites#

Before proceeding, ensure you have the following:

Google Cloud Project: A GCP project with permissions for creating GKE clusters and GPU node pools
GPU Quota: Sufficient GPU quota in your target region for your chosen GPU type (check in Cloud Console under Quotas)
Required Tools:
- Google Cloud SDK
- kubectl (install with gcloud components install kubectl)
- Helm 3
- gke-gcloud-auth-plugin

Initial Setup#

Complete the following setup steps before you deploy NIM on GKE.

Authenticate and Enable APIs#

Authenticate to Google Cloud, select your project, and enable the required services.

Authenticate with the Google Cloud CLI:
```
gcloud auth login
```

Set the active project:

gcloud config set project ${YOUR_PROJECT_ID}

Authenticate Application Default Credentials:
```
gcloud auth application-default login
```

Enable the required APIs:

gcloud services enable container.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable artifactregistry.googleapis.com

Set Environment Variables#

Set environment variables for your GCP project, region, and NIM deployment. Update the placeholder values with your own:

export GCP_PROJECT="${YOUR_PROJECT_ID}"
export GCP_REGION="${YOUR_REGION}"        # for example, europe-west4, us-central1
export GCP_ZONE="${YOUR_ZONE}"            # for example, europe-west4-a
export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export GCS_BUCKET="${YOUR_BUCKET_NAME}"
export GKE_CLUSTER="${YOUR_CLUSTER_NAME}"
export NODE_POOL="gpu-pool"
export NAMESPACE="default"

Create GKE Cluster#

Create the network and GKE control plane before you add a GPU node pool for NIM LLM workloads.

Create VPC Network#

Create a dedicated VPC network for the GKE cluster:

gcloud compute networks create nim-network --subnet-mode=auto

Create Cluster#

Create the GKE cluster in that network:

gcloud container clusters create $GKE_CLUSTER \
  --project $GCP_PROJECT \
  --zone $GCP_ZONE \
  --network=nim-network \
  --subnetwork=nim-network \
  --machine-type e2-standard-4 \
  --num-nodes 1 \
  --release-channel regular \
  --image-type COS_CONTAINERD

Add GPU Node Pool#

NIM requires GPU nodes. First, check available GPU types in your zone:

gcloud compute accelerator-types list --filter="zone:$GCP_ZONE" --format="table(name,zone)"

Select a GPU type based on your model size:

GPU	Machine Type	GPU Memory	Target Model Size
NVIDIA L4	g2-standard-8	24 GB	Up to 8B parameters
NVIDIA A100 40GB	a2-highgpu-1g	40 GB	13B to 30B parameters
NVIDIA A100 80GB	a2-ultragpu-1g	80 GB	Up to 70B parameters
NVIDIA H100 80GB	a3-highgpu-8g	8 x 80 GB	Up to 405B parameters

Create the GPU node pool:

gcloud container node-pools create $NODE_POOL \
  --cluster $GKE_CLUSTER \
  --project $GCP_PROJECT \
  --zone $GCP_ZONE \
  --machine-type g2-standard-8 \
  --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
  --num-nodes 1 \
  --image-type COS_CONTAINERD \
  --scopes cloud-platform \
  --disk-size 200

Note

Setting gpu-driver-version=latest triggers automatic NVIDIA driver installation, eliminating the need for a separate DaemonSet.

Get Cluster Credentials#

This command fetches the cluster endpoint and authentication credentials from GCP and writes them to your local kubeconfig file (~/.kube/config).

gcloud container clusters get-credentials $GKE_CLUSTER \
  --zone $GCP_ZONE --project $GCP_PROJECT

Afterward, kubectl commands target your new cluster.

Verify GPU Availability#

kubectl get nodes -o custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu,STATUS:.status.conditions[-1].type'

You should observe at least one node with a GPU count of 1.

Create Secrets#

Create the secrets that the deployment uses to pull the container image and access the model repository.

NGC Registry Secret#

Create the NGC secrets for image pulls and model access:

Create the ngc-secret registry secret.

kubectl create secret docker-registry ngc-secret \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

Create the ngc-api secret that stores the NGC API key.

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

Hugging Face Token for Gated Models#

Optional: If you are deploying a gated Hugging Face model, create an additional secret with your Hugging Face token. NGC models and ungated public Hugging Face models do not require a token.

kubectl create secret generic hf-token \
  --namespace $NAMESPACE \
  --from-literal=HF_TOKEN="${YOUR_HF_TOKEN}"

Reference in Helm with model.hfTokenSecret: hf-token.

Helm Deployment#

Create one of the following values files based on whether you are deploying a model-specific image, a generic image, or a model stored in GCS.

Model-Specific NIM#

Use this option for NIM images built for a specific model, such as Llama 3.1 8B Instruct.

Create values-gke-prebuilt.yaml:

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.7"
  pullPolicy: IfNotPresent

model:
  name: meta/llama3.1-8b-instruct
  ngcAPISecret: ngc-api
  nimCache: /model-store
  openaiPort: 8000
  logLevel: INFO

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi
  storageClass: premium-rwo

imagePullSecrets:
  - name: ngc-secret

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

# The chart default probe timeoutSeconds is 1, which is too aggressive: NIM's
# /v1/health/ready can take slightly longer than 1 second to respond, so the
# probe times out, the pod never reaches Ready, and "helm --wait" blocks until
# the timeout. 10 seconds gives generous headroom.
startupProbe:
  timeoutSeconds: 10
readinessProbe:
  timeoutSeconds: 10
livenessProbe:
  timeoutSeconds: 10

service:
  type: LoadBalancer
  openaiPort: 8000

Model-Free NIM with GCS Storage#

Use this option to serve custom or fine-tuned models stored in GCS with standard directory structure.

Create values-gke-model-free.yaml:

image:
  repository: <NIM_LLM_MODEL_FREE_IMAGE>
  tag: "2.0.7"
  pullPolicy: IfNotPresent

model:
  name: my-custom-model
  ngcAPISecret: ngc-api
  nimCache: /model-store
  openaiPort: 8000
  logLevel: INFO

env:
  - name: NIM_MODEL_PATH
    value: "gs://${GCS_BUCKET}/my-org/my-model"
  - name: NIM_SERVED_MODEL_NAME
    value: "my-custom-model"
  - name: NIM_MAX_MODEL_LEN
    value: "4096"

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi
  storageClass: premium-rwo

imagePullSecrets:
  - name: ngc-secret

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

# The chart default probe timeoutSeconds is 1, which is too aggressive: NIM's
# /v1/health/ready can take slightly longer than 1 second to respond, so the
# probe times out, the pod never reaches Ready, and "helm --wait" blocks until
# the timeout. 10 seconds gives generous headroom.
startupProbe:
  timeoutSeconds: 10
readinessProbe:
  timeoutSeconds: 10
livenessProbe:
  timeoutSeconds: 10

service:
  type: LoadBalancer
  openaiPort: 8000

Note

For NIM_MODEL_PATH with GCS, ensure your model is uploaded with standard directory structure, not percent-encoded keys. Refer to GCS Use Cases: Model-Free vs Repository Override for upload instructions.

Note

GKE pods using Workload Identity or node service accounts with GCS access do not require explicit credentials. For local testing with a service account key, add:

env:
  - name: GOOGLE_APPLICATION_CREDENTIALS
    value: "/credentials/sa.json"
volumeMounts:
  - name: gcp-credentials
    mountPath: /credentials
    readOnly: true
volumes:
  - name: gcp-credentials
    secret:
      secretName: gcp-sa-key

Note

The imagePullSecrets examples above list ngc-secret, which only authenticates to nvcr.io. When you deploy an image stored in Artifact Registry, the nodes also need pull access to ${GCP_REGION}-docker.pkg.dev. Use one of the following, in addition to ngc-secret:

Grant the node service account roles/artifactregistry.reader (recommended for production, no token expiry).
Create an ar-pull docker-registry secret that uses a short-lived access token, then add it to imagePullSecrets:
```
kubectl create secret docker-registry ar-pull \
  --namespace $NAMESPACE \
  --docker-server=${GCP_REGION}-docker.pkg.dev \
  --docker-username=oauth2accesstoken \
  --docker-password="$(gcloud auth print-access-token)"
```
The access token is valid for about one hour, so prefer roles/artifactregistry.reader on the node service account for long-running deployments.

Install the Helm Release#

Set the release name and install the Helm release:

export RELEASE_NAME="my-nim"

helm upgrade --install $RELEASE_NAME ./helm \
  --namespace $NAMESPACE \
  -f values-gke-prebuilt.yaml \
  --timeout 45m \
  --wait

For the generic (model-free) image, use -f values-gke-model-free.yaml instead.

Monitor Deployment#

Monitor the deployment:

kubectl -n $NAMESPACE get pods -l "app.kubernetes.io/name=nim-llm" -w

Wait until the pod status is Running and ready.

Configure Network Access#

GKE often blocks non-standard ports. Create a firewall rule:

gcloud compute firewall-rules create allow-nim-8000 \
  --network=nim-network \
  --allow=tcp:8000 \
  --source-ranges=0.0.0.0/0 \
  --description="Allow NIM API port 8000"

Note

GKE automatically provisions firewall rules for type: LoadBalancer services in most clusters, so this rule is usually not needed. Create allow-nim-8000 manually only if your environment prevents GKE from auto-creating load balancer firewall rules (for example, restricted IAM permissions, a shared VPC, or an org firewall policy). In that case, the LoadBalancer IP is assigned but traffic to port 8000 is dropped, so the health check and chat completion tests appear to hang. The integration-test script always creates this rule so runs work regardless of environment.

Test the GKE Deployment#

Use the following steps to get the service endpoint and verify that the deployment responds to requests.

Get the Endpoint#

For LoadBalancer service, wait for the external IP to be assigned:

kubectl -n $NAMESPACE get svc -l "app.kubernetes.io/name=nim-llm" -w
# Wait until EXTERNAL-IP changes from <pending> to an IP address, then Ctrl+C

export NIM_IP=$(kubectl -n $NAMESPACE get svc -l "app.kubernetes.io/name=nim-llm" \
  -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
echo "NIM endpoint: http://$NIM_IP:8000"

For ClusterIP service, use port-forward:

kubectl -n $NAMESPACE port-forward svc/${RELEASE_NAME}-nim-llm 8000:8000

Health Check#

Confirm the NIM API is up and ready:

curl -s "http://$NIM_IP:8000/v1/health/ready"

Test Chat Completions#

Send a test chat completion request and verify inference is working:

curl -X POST "http://$NIM_IP:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta/llama3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256}'

Note

The model field must match the served model name, which is the value of NIM_SERVED_MODEL_NAME or the image’s com.nvidia.nim.model label. Replace meta/llama3.1-8b-instruct with the name your image serves. A mismatched name returns a model not found error.

Run Helm Tests#

Run the suite of Helm chart tests for your release:

helm test $RELEASE_NAME -n $NAMESPACE

View Logs#

View real-time logs for your deployed NIM pods:

kubectl -n $NAMESPACE logs -l "app.kubernetes.io/name=nim-llm" -f

GKE Model Sourcing Options#

NIM on GKE supports multiple model sources:

Source	Method	Environment Variable	GKE Support
NGC (built-in manifest)	Container downloads from NGC at startup	`NGC_API_KEY` (K8s secret)	Yes
NGC direct	NIM generates manifest, downloads from NGC	`NIM_MODEL_PATH=ngc://org/model:ver`	Yes
NGC mirrored to GCS	Model files in GCS with percent-encoded keys	`NIM_REPOSITORY_OVERRIDE=gs://bucket`	Yes (refer to Model Storage Options)
Hugging Face direct	NIM generates a manifest and downloads from Hugging Face	`NIM_MODEL_PATH=hf://org/model`	Yes
Local path	Model files pre-loaded on PVC or hostPath	`NIM_MODEL_PATH=/opt/nim/models/`	Yes

GKE Authentication#

GKE supports multiple authentication methods for GCS access:

Method	Use Case	Configuration
Application Default Credentials (ADC)	Local development	Run `gcloud auth application-default login`
Workload Identity	Production GKE	Link a Kubernetes (K8s) service account to a GCP service account
Service Account JSON	CI/CD pipelines	Set `GOOGLE_APPLICATION_CREDENTIALS`

Note

Workload Identity is the recommended approach for production GKE deployments. It eliminates the need for credential files in pods.

GKE Cleanup#

Use these steps to remove the deployment and its GKE resources.

Uninstall Helm Release#

Uninstall the Helm release:

helm uninstall $RELEASE_NAME -n $NAMESPACE

Delete PVCs#

Delete the persistent volume claims:

kubectl delete pvc -n $NAMESPACE -l app.kubernetes.io/name=nim-llm

Delete GKE Cluster#

Delete the GKE cluster:

gcloud container clusters delete $GKE_CLUSTER --zone=$GCP_ZONE --quiet

Optional: Delete Firewall Rule#

Delete the firewall rule if you created it earlier:

gcloud compute firewall-rules delete allow-nim-8000 --quiet

Delete VPC Network#

Delete the VPC network:

gcloud compute networks delete nim-network --quiet

Vertex AI Deployment#

Vertex AI is a fully managed ML platform that eliminates the need for infrastructure management. It provides built-in load balancing, auto-scaling, and authentication through Application Default Credentials (ADC).

Key Constraints#

Vertex AI has specific requirements that differ from GKE deployments:

Aspect	Requirement
Container Images	Must be sourced from Artifact Registry (direct pulls from nvcr.io or GitLab are not supported)
Model Storage	GCS optional (required only for GCS-backed artifacts or used as a mirror). By default, NIM downloads model artifacts at runtime from the NGC or Hugging Face image manifest given outbound network access. GCS using the `gs://` protocol is required only when artifacts are GCS-backed (`NIM_REPOSITORY_OVERRIDE` or `NIM_MODEL_PATH=gs://`) or used as an optional mirror
Authentication	Uses Application Default Credentials (ADC) automatically

Prerequisites#

Before proceeding, ensure you have the following:

GCP Project: A project with permissions for Vertex AI, Artifact Registry, and GCS
GPU Quota: Sufficient GPU quota in your region for your chosen GPU type (check in Cloud Console under Quotas)
Required Tools:
- Google Cloud SDK
- Docker (for pushing images to Artifact Registry)
- Optional: pip install google-cloud-aiplatform for Python SDK access

Initial Setup#

Complete the following setup steps before you deploy NIM on Vertex AI.

Authenticate and Enable APIs#

Authenticate to Google Cloud, select your project, and enable the required services.

Authenticate with the Google Cloud CLI:
```
gcloud auth login
```

Set the active project:

gcloud config set project ${YOUR_PROJECT_ID}

Authenticate Application Default Credentials:
```
gcloud auth application-default login
```

Enable the required APIs:

gcloud services enable aiplatform.googleapis.com
gcloud services enable artifactregistry.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable compute.googleapis.com

Note

The API must be enabled before proceeding. If you encounter errors indicating that the API is disabled, run the gcloud services enable commands and wait one to two minutes.

Set Environment Variables#

Set the environment variables for your Google Cloud resources:

export GCP_PROJECT="${YOUR_PROJECT_ID}"
export GCP_REGION="${YOUR_REGION}"              # for example, europe-west4, us-central1
export GCS_BUCKET="${YOUR_GCS_BUCKET}"          # for model storage
export AR_REPO="${YOUR_AR_REPO}" # for example, nim-repo
export AR_IMAGE="${GCP_REGION}-docker.pkg.dev/${GCP_PROJECT}/${AR_REPO}/nim-llm"
export AR_TAG="2.0.7"

Create GCS Bucket and Grant Access#

Create the GCS bucket and grant Vertex AI permission to read from it.

Create the bucket:

gsutil mb -l $GCP_REGION -p $GCP_PROJECT gs://$GCS_BUCKET

Create the Vertex AI service identity if this is your first Vertex AI deployment in the project:

gcloud beta services identity create --service=aiplatform.googleapis.com --project=$GCP_PROJECT

Get the project number:

PROJECT_NUMBER=$(gcloud projects describe $GCP_PROJECT --format="value(projectNumber)")

Grant the Vertex AI service agent access to the bucket:

gsutil iam ch \
  serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com:objectViewer \
  gs://$GCS_BUCKET

Note

If deployment fails with a permission error for custom-online-prediction@TENANT_ID-tp.iam.gserviceaccount.com, extract the TENANT_ID from the error and grant access:

export TENANT_ID=${TENANT_ID_FROM_ERROR}
gsutil iam ch \
  serviceAccount:custom-online-prediction@${TENANT_ID}-tp.iam.gserviceaccount.com:objectViewer \
  gs://$GCS_BUCKET

Create Artifact Registry Repository#

Create the Artifact Registry repository, grant Vertex AI access, and configure Docker authentication.

Create the repository:

gcloud artifacts repositories create $AR_REPO \
  --repository-format=docker \
  --location=$GCP_REGION \
  --project=$GCP_PROJECT

Get the project number:

PROJECT_NUMBER=$(gcloud projects describe $GCP_PROJECT --format="value(projectNumber)")

Grant the Vertex AI service agent read access to Artifact Registry:

gcloud projects add-iam-policy-binding $GCP_PROJECT \
  --member="serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com" \
  --role="roles/artifactregistry.reader"

Configure Docker authentication for Artifact Registry:

gcloud auth configure-docker ${GCP_REGION}-docker.pkg.dev

Push NIM Image to Artifact Registry#

Vertex AI cannot pull images from external registries. You must re-tag and push the NIM image to Artifact Registry:

Log in to NGC:
```
docker login nvcr.io
```

Pull the NIM image from NGC:

docker pull ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.7

Re-tag the image for Artifact Registry:

docker tag ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.7 ${AR_IMAGE}:${AR_TAG}

Push the image to Artifact Registry:
```
docker push ${AR_IMAGE}:${AR_TAG}
```

Upload Model to GCS#

GCS is optional (required only for GCS-backed artifacts or used as a mirror). By default, NIM downloads model artifacts at runtime from the NGC or Hugging Face image manifest given outbound network access. Upload artifacts to GCS only when they are GCS-backed (NIM_REPOSITORY_OVERRIDE or NIM_MODEL_PATH=gs://) or used as an optional mirror. Refer to Model Storage Options for upload instructions.

Register Model in Vertex AI#

Register the NIM container as a Vertex AI model and set the environment variables that tell the container how to load its model artifacts. By default, the NGC or Hugging Face image manifest downloads artifacts at runtime given outbound network access, so no GCS upload is required. Choose the command that matches your model source: the NGC or Hugging Face default, or a GCS-backed source you prepared earlier.

Note

NIM downloads its own model artifacts at container startup; it does not use the Vertex AI managed model-artifact mechanism. As a result, a model-specific (NGC) image can serve without any GCS upload when the Vertex AI endpoint has outbound internet access. The image relies on the manifest baked into the image (set NGC_API_KEY for NGC pulls or HF_TOKEN for gated Hugging Face pulls). Use a GCS source (NIM_REPOSITORY_OVERRIDE=gs://... or NIM_MODEL_PATH=gs://...) for air-gapped endpoints, latency, or to avoid runtime NGC dependence. The NIM_SERVED_MODEL_NAME value (and the "model" field in requests) must match the name baked into the image (com.nvidia.nim.model docker label) for model-specific images.

For an NGC or Hugging Face Image (Default)#

Use this option for a model-specific image that downloads its artifacts at startup from NGC or Hugging Face. The endpoint needs outbound internet access, and no GCS bucket is required. Pass NGC_API_KEY for NGC pulls (add HF_TOKEN for gated Hugging Face models) and set NIM_SERVED_MODEL_NAME to the name the image serves:

gcloud ai models upload --region=$GCP_REGION \
  --display-name=nim-llm-llama-8b \
  --container-image-uri=${AR_IMAGE}:${AR_TAG} \
  --container-ports=8000 \
  --container-health-route=/v1/health/ready \
  --container-predict-route=/v1/chat/completions \
  --container-shared-memory-size-mb=2048 \
  --container-env-vars="NIM_SERVED_MODEL_NAME=meta/llama3.1-8b-instruct,NGC_API_KEY=${NGC_API_KEY}"

Append optional variables to --container-env-vars as a comma-separated list when you need them, for example NIM_MODEL_PATH=... for a model-free image, NIM_REPOSITORY_OVERRIDE=gs://... for an NGC GCS mirror, or HF_TOKEN=... for gated Hugging Face models.

For NIM_REPOSITORY_OVERRIDE#

Use this option if you mirrored an NGC model to GCS with percent-encoded object keys and want NIM to rewrite the built-in manifest to gs:// paths at runtime:

gcloud ai models upload --region=$GCP_REGION \
  --display-name=nim-llm-llama-8b \
  --container-image-uri=${AR_IMAGE}:${AR_TAG} \
  --container-ports=8000 \
  --container-health-route=/v1/health/ready \
  --container-predict-route=/v1/chat/completions \
  --container-env-vars="NIM_REPOSITORY_OVERRIDE=gs://${GCS_BUCKET}"

For NIM_MODEL_PATH#

Use this option if you uploaded the model to GCS with its standard directory structure and want NIM to load it directly from that path:

gcloud ai models upload --region=$GCP_REGION \
  --display-name=nim-llm-llama-8b \
  --container-image-uri=${AR_IMAGE}:${AR_TAG} \
  --container-ports=8000 \
  --container-health-route=/v1/health/ready \
  --container-predict-route=/v1/chat/completions \
  --container-env-vars="NIM_MODEL_PATH=gs://${GCS_BUCKET}/llama-3.1-8b"

Note

For NIM_MODEL_PATH, use a generic model-free image. Model-specific images might fail with “Model URI missing version suffix”.

Create Endpoint and Deploy#

Important

NIM LLM requires a GPU driver that supports CUDA 13.0 or later. If the default GPU driver on Vertex AI is too old for your deployment, startup can fail with RuntimeError: The NVIDIA driver on your system is too old.

For this reason, the recommended default is to deploy with gcloud beta ai endpoints deploy-model and pin --min-gpu-driver-version=580.65.06. The GA gcloud ai endpoints deploy-model command does not support this flag, so use it only when you have confirmed the default Vertex AI driver is new enough.

To create a Vertex AI endpoint and deploy your NIM model, complete the following steps in the same shell session:

Create the endpoint:

gcloud ai endpoints create --region=$GCP_REGION --display-name=nim-llm-endpoint

Get the endpoint and model IDs:

ENDPOINT_ID=$(gcloud ai endpoints list --region=$GCP_REGION \
  --filter="displayName:nim-llm-endpoint" --format="value(name)" | awk -F/ '{print $NF}')
MODEL_ID=$(gcloud ai models list --region=$GCP_REGION \
  --filter="displayName:nim-llm-llama-8b" --format="value(name)" | awk -F/ '{print $NF}')

Deploy the model to the endpoint. The recommended command uses the beta track with --min-gpu-driver-version=580.65.06 because NIM LLM requires a driver that supports CUDA 13.0 or later. The default Vertex AI driver is often too old:

gcloud beta ai endpoints deploy-model $ENDPOINT_ID --region=$GCP_REGION \
  --model=$MODEL_ID \
  --display-name=nim-llm-v1 \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --min-gpu-driver-version=580.65.06 \
  --min-replica-count=1 \
  --max-replica-count=3 \
  --traffic-split=0=100

If you have confirmed that the default Vertex AI GPU driver already supports CUDA 13.0 or later, you can use the GA command, which does not accept --min-gpu-driver-version:

gcloud ai endpoints deploy-model $ENDPOINT_ID --region=$GCP_REGION \
  --model=$MODEL_ID \
  --display-name=nim-llm-v1 \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --min-replica-count=1 \
  --max-replica-count=3 \
  --traffic-split=0=100

Test the Vertex AI Endpoint#

Vertex AI endpoints are accessed through the Vertex AI REST API using rawPredict:

ENDPOINT_URL="https://${GCP_REGION}-aiplatform.googleapis.com/v1/projects/${GCP_PROJECT}/locations/${GCP_REGION}/endpoints/${ENDPOINT_ID}"

curl -X POST "${ENDPOINT_URL}:rawPredict" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d '{"model":"meta/llama3.1-8b-instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

Note

The model field must match the served model name, which is the value of NIM_SERVED_MODEL_NAME or the image’s com.nvidia.nim.model label. Replace meta/llama3.1-8b-instruct with the name your image serves.

Vertex AI Cleanup#

Use this section to remove the deployed Vertex AI resources and related artifacts.

Undeploy and Delete Resources#

To programmatically clean up your deployed Vertex AI resources and remove related artifacts, you can use the following commands:

# Set your environment variables
export MODEL_DISPLAY_NAME="nim-llm-llama-8b"
export ENDPOINT_DISPLAY_NAME="nim-llm-endpoint"

# Find endpoint ID
ENDPOINT_ID=$(gcloud ai endpoints list --region=$GCP_REGION --project=$GCP_PROJECT \
  --filter="displayName:${ENDPOINT_DISPLAY_NAME}" --format='value(name)' | head -1)

# Undeploy model from endpoint
if [[ -n "$ENDPOINT_ID" ]]; then
  DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
    --region=$GCP_REGION --project=$GCP_PROJECT \
    --format='value(deployedModels.id)' | head -1)

  if [[ -n "$DEPLOYED_MODEL_ID" ]]; then
    gcloud ai endpoints undeploy-model $ENDPOINT_ID \
      --region=$GCP_REGION \
      --project=$GCP_PROJECT \
      --deployed-model-id=$DEPLOYED_MODEL_ID \
      --quiet
  fi
fi

# Delete endpoint
if [[ -n "$ENDPOINT_ID" ]]; then
  gcloud ai endpoints delete $ENDPOINT_ID \
    --region=$GCP_REGION \
    --project=$GCP_PROJECT \
    --quiet
fi

# Find and delete model from registry
MODEL_ID=$(gcloud ai models list --region=$GCP_REGION --project=$GCP_PROJECT \
  --filter="displayName:${MODEL_DISPLAY_NAME}" --format='value(name)' | head -1)

if [[ -n "$MODEL_ID" ]]; then
  gcloud ai models delete $MODEL_ID \
    --region=$GCP_REGION \
    --project=$GCP_PROJECT \
    --quiet
fi

# Delete Artifact Registry image
gcloud artifacts docker images delete "${AR_IMAGE}:${AR_TAG}" \
  --project=$GCP_PROJECT \
  --quiet

Delete Infrastructure#

Optional: To fully clean up your project resources—including removing the Artifact Registry repository and deleting all objects and the bucket itself from Google Cloud Storage—run the following commands:

# Delete Artifact Registry repository
gcloud artifacts repositories delete $AR_REPO \
  --location=$GCP_REGION \
  --project=$GCP_PROJECT \
  --quiet

# Delete all objects in the GCS bucket and remove the bucket
gsutil -m rm -r "gs://${GCS_BUCKET}/**"
gsutil rb "gs://${GCS_BUCKET}"

GCS Use Cases: Model-Free vs Repository Override#

There are two distinct ways to serve models from GCS. They use different GCS key formats, different upload workflows, and serve fundamentally different purposes.

NIM_REPOSITORY_OVERRIDE (NGC Mirror)#

Scenario: You have an NGC model with a built-in manifest containing BLAKE3 checksums, and you want to serve it from GCS instead of NGC at runtime. This is not model-free mode. The container ships with a model-specific manifest.

Upload: Use the upload-to-gcs.sh script. The script percent-encodes each NGC URI (without the ngc:// scheme) into a flat GCS key, and uploads the files by using gsutil.

GCS key format: Flat percent-encoded keys. Each file becomes a single object with no directory structure:

nim%2Fmeta%2Fllama-3.1-8b-instruct%3Ahf%3Ffile%3Dconfig.json
nim%2Fmeta%2Fllama-3.1-8b-instruct%3Ahf%3Ffile%3Dmodel.safetensors

Download: At runtime, NIM LLM reads the built-in manifest (still has ngc:// URIs), sees NIM_REPOSITORY_OVERRIDE=gs://bucket, percent-encodes each URI, rewrites it to gs://bucket/<encoded-key>, downloads from GCS, and verifies BLAKE3 checksums from the original manifest.
When to use: Air-gapped environments with no NGC access at runtime, enterprise GCS mirrors of NGC models, and latency-sensitive deployments where GCS is closer than NGC.

Step 1: Download model to local cache

First, download the model files from NGC to your local cache:

export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export STAGING_NIM_CACHE=/tmp/nim-cache

docker run --rm \
  -e NGC_API_KEY \
  -v "$STAGING_NIM_CACHE:/opt/nim/.cache" \
  ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.7 \
  download-to-cache --all

Step 2: Upload with percent encoding

Create an upload-to-gcs.sh script that handles percent-encoding:

#!/bin/bash
# Script: upload-to-gcs.sh
# Upload model files to GCS with percent-encoded keys for NIM_REPOSITORY_OVERRIDE.
#
# Usage: ./upload-to-gcs.sh /path/to/local/model/store gs://mybucket "org/model:version?file="
#
# IMPORTANT: MODEL_PREFIX must NOT include "ngc://". nim-sdk strips the scheme
# before encoding, so the GCS key must not contain it either.
# Correct:   "nim/meta/llama-3.1-8b-instruct:hf?file="
# Incorrect: "ngc://nim/meta/llama-3.1-8b-instruct:hf?file="

set -euo pipefail

LOCAL_DIR=$1        # Directory containing model files
GCS_BUCKET=$2       # GCS bucket (for example, gs://mybucket)
MODEL_PREFIX=$3     # URI prefix WITHOUT ngc:// scheme (for example, nim/meta/model:v1?file=)

# Guard against accidental ngc:// inclusion
if [[ "$MODEL_PREFIX" == ngc://* ]]; then
    echo "ERROR: MODEL_PREFIX must not include ngc:// -- nim-sdk strips the scheme."
    echo "  Got: $MODEL_PREFIX"
    echo "  Use: ${MODEL_PREFIX#ngc://}"
    exit 1
fi

STAGING_DIR=$(mktemp -d)
trap "rm -rf $STAGING_DIR" EXIT

echo "Staging files with percent-encoded keys..."
find "$LOCAL_DIR" -type f | while read -r file; do
    # Get relative path from local directory
    rel_path="${file#$LOCAL_DIR/}"

    # Build the full source URI
    raw_key="${MODEL_PREFIX}${rel_path}"

    # Percent-encode the key
    encoded_key=$(python3 -c "import sys; from urllib.parse import quote; print(quote(sys.argv[1], safe=''))" "$raw_key")

    # Stage the file with encoded name
    mkdir -p "$(dirname "$STAGING_DIR/$encoded_key")"
    cp "$file" "$STAGING_DIR/$encoded_key"
done

echo "Uploading all encoded files to $GCS_BUCKET ..."
gsutil -m cp -r "$STAGING_DIR/"* "$GCS_BUCKET/"

echo "Upload complete!"

Run the script:

chmod +x upload-to-gcs.sh
./upload-to-gcs.sh "$STAGING_NIM_CACHE" gs://${GCS_BUCKET} "nim/meta/llama-3.1-8b-instruct:hf?file="

Important

The MODEL_PREFIX must not include the ngc:// scheme. nim-sdk strips the scheme internally when computing GCS keys, so including it in the upload produces keys that do not match what the download expects. To find the correct prefix, check the file URIs in the container’s built-in manifest and strip ngc:// from the front.

Run NIM:

Note

For production deployments, use Workload Identity (GKE) or Application Default Credentials (ADC) instead of mounting service account key files. The example below uses a mounted key for local testing only.

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e NIM_REPOSITORY_OVERRIDE=gs://my-bucket \
  -e GOOGLE_APPLICATION_CREDENTIALS=/credentials/sa.json \
  -v /path/to/sa.json:/credentials/sa.json:ro \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_SPECIFIC_IMAGE}:2.0.7

NIM_MODEL_PATH (Model-Free)#

Scenario: You have your own fine-tuned model already in GCS with normal directory structure. The model was never on NGC. This is model-free mode. The container is generic and serves whatever model you point it to.

This approach has the following characteristics:

Upload: Use any GCS tool with no special encoding:

gsutil cp -r ./my-model/ gs://my-bucket/my-org/my-model/

GCS key format: Standard directory-style keys, exactly as uploaded:

my-org/my-model/config.json
my-org/my-model/model.safetensors
my-org/my-model/tokenizer.json

Manifest generation: At startup, NIM LLM dynamically generates a model manifest based on the specified model.
When to use: Customer fine-tuned models in GCS, no NGC involvement, models uploaded with standard directory structure.

URI Format: gs://{bucket}/{path}/{model-name}

Example URI: gs://my-bucket/my-org/my-fine-tuned-model

Upload your model:

gsutil cp -r ./my-model/ gs://my-bucket/my-org/my-model/

Run NIM:

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e NIM_MODEL_PATH=gs://my-bucket/my-org/my-model \
  -e GOOGLE_APPLICATION_CREDENTIALS=/credentials/sa.json \
  -v /path/to/sa.json:/credentials/sa.json:ro \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.7

For GKE or Vertex AI (using ADC, no credentials file needed):

docker run --gpus=all \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -e NIM_MODEL_PATH=gs://my-bucket/my-org/my-model \
  -p 8000:8000 \
  ${NIM_LLM_MODEL_FREE_IMAGE}:2.0.7

Comparison#

The following table compares the two GCS-based model-loading approaches and highlights when to use each one.

Aspect	NIM_REPOSITORY_OVERRIDE	NIM_MODEL_PATH=gs://
Purpose	Serve NGC model from GCS	Serve custom model from GCS
Model-free mode?	No (model-specific container)	Yes (generic container)
Model origin	NGC (has pre-built manifest + checksums)	Customer-uploaded (no manifest)
Upload tool	`upload-to-gcs.sh` or manual percent-encoding	`gsutil cp` or any tool
GCS key format	Percent-encoded flat keys	Normal directory structure
Manifest	Built into container at build time	Generated at runtime by nimlib
Manifest URI protocol	`ngc://` (rewritten to `gs://` at runtime)	`gsrepo://`
Checksums	BLAKE3 from NGC (pre-computed)	BLAKE3 computed during manifest generation
Double download?	No (checksums come from the file, so download happens one time)	No (computed from cache and served from cache)
NIM image type	Pre-built or generic	Generic only

Important

These two approaches use incompatible key formats. Files uploaded with percent-encoding cannot be consumed by NIM_MODEL_PATH=gs:// (expects directory structure), and vice versa.

Authentication#

GCS supports Application Default Credentials (ADC). Authentication is resolved in the following order:

GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to a service account key file
Vertex AI managed environment credentials
Google Compute Engine (GCE) metadata service
Cloud Run environment
gcloud CLI user credentials

Variable	Required	Purpose
`GOOGLE_APPLICATION_CREDENTIALS`	Yes (unless using ADC)	Path to service account JSON key file
`GCS_ENDPOINT_URL`	Only for GCS emulators	Custom endpoint (for example, `http://localhost:4443` for fake-gcs)
`STORAGE_EMULATOR_HOST`	Only for GCS emulators	Enables anonymous access for GCS emulators

Troubleshooting#

Use the following troubleshooting tables to diagnose common deployment issues on GKE and Vertex AI.

GKE Issues#

The following table lists common GKE deployment issues and recommended resolutions.

Issue	Resolution
GKE API not enabled	Run `gcloud services enable container.googleapis.com` and wait 1-2 minutes
Pod status: Pending	Verify GPU node pool exists, tolerations match, and PVC is bound
Pod status: ImagePullBackOff	Check imagePullSecrets and NGC_API_KEY validity
Pod status: CrashLoopBackOff	Check logs with `kubectl logs`, verify environment variables
Startup probe failure	Model loading can take up to 30 minutes for large models. Increase `failureThreshold` if needed.
No GPU detected	Verify `nvidia.com/gpu` in node capacity, and ensure that `gpu-driver-version=latest` was set.
LoadBalancer stuck pending	Check GCP permissions. This process can take several minutes.
Port 8000 blocked	Create firewall rule to allow TCP port 8000

Vertex AI Issues#

The following table lists common Vertex AI deployment issues and recommended resolutions.

Issue	Resolution
Vertex AI API not enabled	Run `gcloud services enable aiplatform.googleapis.com` and wait 1-2 minutes
`RuntimeError: The NVIDIA driver on your system is too old`	The default Vertex AI GPU driver is too old for NIM LLM. Deploy using `gcloud beta ai endpoints deploy-model` with `--min-gpu-driver-version=580.65.06` or later. Refer to Create Endpoint and Deploy.
Deployment stuck	Model loading takes 20 to 45 minutes. Check operation status with `gcloud ai operations list --region=$GCP_REGION`.
401 error on GCS bucket	Grant Vertex AI service account access to the bucket
Permission denied for custom-online-prediction	Grant the tenant project service account `objectViewer` access to the GCS bucket
Image pull failed	Verify image is in Artifact Registry and Vertex AI has `artifactregistry.reader` role
Model URI missing version suffix	Use a generic model-free image with `NIM_MODEL_PATH` instead of model-specific images

For additional troubleshooting, refer to Support and FAQ.

References#

Refer to the following Google Cloud documentation for more information: