Oracle#

This guide covers deploying NIM LLM on Oracle Cloud Infrastructure (OCI) using OKE (Oracle Kubernetes Engine), a managed Kubernetes service for running containerized workloads on OCI.

OKE Deployment#

Create an OKE cluster with GPU-capable worker nodes and prepare the OCI environment for NIM LLM workloads. This section covers prerequisites, cluster creation, and initial setup.

Prerequisites#

Install the following tools before proceeding:

You also need an NGC API key with access to NIM LLM container images and Helm charts.

Note

Your OCI tenancy must have GPU quota available in the target region. Verify quota in the OCI Console under Limits, Quotas and Usage and ensure the compartment has a configured VCN, subnets, internet gateway, route table, and security lists.

Note

Match the OCI GPU shape to the model you plan to serve. Larger models require more aggregate VRAM and may need multi-GPU shapes — 70B-class models can require up to 8 GPUs depending on precision and tensor parallelism. Refer to the OCI GPU shape documentation for available options.

Create an OKE Cluster through OCI#

Set environment variables used throughout this guide:

export CLUSTER_OCID="${YOUR_CLUSTER_OCID}"
export NAMESPACE="nim-llm"
export NIM_LLM_CHART_VERSION="${YOUR_CHART_VERSION}"
export OCI_REGION="${YOUR_OCI_REGION}"
export RELEASE_NAME="my-nim"

Create an OKE cluster with a GPU-capable node pool using the OCI Console:

  1. Navigate to Developer Services > Kubernetes Clusters (OKE) > Create cluster and select Quick create.

  2. Configure a public API endpoint and choose a GPU shape appropriate for your model.

  3. Under Advanced options, set the boot volume size to at least 500 GB. The OCI default (approximately 47 GB, presenting as approximately 35 GB inside the OS) is exhausted by NIM LLM container images and the model cache during a single deployment.

  4. Submit and wait for the cluster to provision.

Note

For production workloads, use an Enhanced cluster (includes a financially-backed SLA).

After the cluster is created, configure kubectl access:

oci ce cluster create-kubeconfig \
  --cluster-id ${CLUSTER_OCID} \
  --file $HOME/.kube/config \
  --region ${OCI_REGION} \
  --token-version 2.0.0 \
  --kube-endpoint PUBLIC_ENDPOINT

Verify connectivity by running:

kubectl get nodes

Expand the Boot Volume#

OKE does not automatically grow the on-disk filesystem inside the OS, even when the boot volume itself is sized correctly at the node-pool level. Expand the file system before deployment. Without this step, the node hits DiskPressure during deployment, and pods are evicted.

Check first:

kubectl describe nodes | grep ephemeral-storage | head -1

If ephemeral-storage reads as approximately 35 GB (about 37206272Ki), grow the filesystem in-place using a privileged pod (no SSH access required):

NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')

kubectl run growfs --rm -it --restart=Never --privileged \
  --overrides='{"spec":{"hostPID":true,"nodeName":"'$NODE'","tolerations":[{"operator":"Exists"}]}}' \
  --image=docker.io/library/oraclelinux:8 \
  -- nsenter -t 1 -m -u -i -n /usr/libexec/oci-growfs -y

kubectl run restart-kubelet --rm -it --restart=Never --privileged \
  --overrides='{"spec":{"hostPID":true,"nodeName":"'$NODE'","tolerations":[{"operator":"Exists"}]}}' \
  --image=docker.io/library/oraclelinux:8 \
  -- nsenter -t 1 -m -u -i -n systemctl restart kubelet

Re-run the kubectl describe nodes check to confirm ephemeral-storage now matches the boot volume size.

Install the NVIDIA GPU Operator#

The NVIDIA GPU Operator manages the GPU device plugin, container toolkit, and runtime class on OKE nodes. OKE GPU node images ship with the NVIDIA driver pre-installed, so the operator is deployed with the bundled driver disabled:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=true

OKE GPU node pools occasionally attach a nvidia.com/gpu:NoSchedule taint to the node. The chart tolerates the taint, but clearing it lets non-GPU workloads share the node:

kubectl taint nodes --all nvidia.com/gpu:NoSchedule- 2>/dev/null || true

Verify the operator is running and GPUs are visible to the cluster:

kubectl get pods -n gpu-operator
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.allocatable.nvidia\.com/gpu}{" GPU(s)\n"}{end}'

If a node advertises fewer GPUs than its shape provides, restart the device plugin:

kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset

For more information, refer to the NVIDIA GPU Operator documentation.

Create Kubernetes Secrets#

The Helm chart requires two secrets in the deployment namespace. Missing either one causes the pod to fail with CreateContainerConfigError after scheduling:

  • ngc-secret (type docker-registry) is used by the kubelet to pull images from nvcr.io.

  • ngc-api (type generic) is mounted as the NGC_API_KEY environment variable so the pod can authenticate to NGC for the model download.

export IMAGE_PULL_SECRET="ngc-secret"
export NGC_API_KEY="${YOUR_NGC_API_KEY}"

kubectl create namespace $NAMESPACE

kubectl create secret docker-registry $IMAGE_PULL_SECRET \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

Confirm both secrets exist and ngc-api is non-empty:

kubectl get secret -n $NAMESPACE
kubectl get secret ngc-api -n $NAMESPACE -o jsonpath='{.data.NGC_API_KEY}' | base64 -d | wc -c

Create a Persistent Volume Claim#

NIM LLM requires persistent storage to cache the downloaded model across pod restarts. The Helm chart provisions a PVC automatically when you set persistence.enabled=true on helm install (covered in the next section). The PVC uses the cluster’s default storage class, which is oci-bv on OKE for OCI Block Volume.

For faster cold starts on large models, optionally create an Ultra High Performance Block Volume storage class and reference it with persistence.storageClass:

kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: oci-bv-uhp
provisioner: blockvolume.csi.oraclecloud.com
parameters:
  vpusPerGB: "30"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete
EOF

Deploy NIM LLM with Helm#

  1. Fetch the Helm chart from NGC:

    helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-${NIM_LLM_CHART_VERSION}.tgz
    
  2. Optional: View the default chart values to understand available configuration options.

    helm show values nim-llm-${NIM_LLM_CHART_VERSION}.tgz
    

    Tip

    For help choosing the right model configuration, refer to Model Profiles and Selection.

  3. Install the chart with inline --set flags. The following example uses nvcr.io/nim/meta/llama-3.1-8b-instruct on a single GPU. Adjust the image, GPU count, and persistence.size to match your model.

    helm install $RELEASE_NAME nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
      --namespace $NAMESPACE \
      --set image.repository=nvcr.io/nim/meta/llama-3.1-8b-instruct \
      --set image.tag="$NIM_LLM_CHART_VERSION" \
      --set 'resources.limits.nvidia\.com/gpu=1' \
      --set persistence.enabled=true \
      --set persistence.size=80Gi \
      --set-string 'env[0].name=NIM_MAX_MODEL_LEN' --set-string 'env[0].value=32768'
    

    Note

    The chart’s default persistence.size of 50Gi does not provide enough headroom for the model weights and the container image cache. Use 80Gi as a safe minimum for approximately 10 GB to 16 GB model weights, and increase the size proportionally for larger models. NIM_MAX_MODEL_LEN caps the KV-cache allocation to fit the available VRAM. Adjust per your shape’s GPU memory.

    Note

    If you created the optional oci-bv-uhp storage class in the previous section, add --set persistence.storageClass=oci-bv-uhp to the command above to provision the PVC on Ultra High Performance block storage. Omit this flag to use the cluster default (oci-bv).

Verify the Deployment#

  1. Watch the pod come up:

    kubectl get pods -n $NAMESPACE -w
    

    First-time startup includes image pull, model download from NGC, and engine initialization. Expect five to ten minutes on the oci-bv-uhp storage class and 15 to 25 minutes on default oci-bv.

  2. Confirm the chart and NIM LLM versions, and that the GPU is bound to the pod:

    helm list -n $NAMESPACE
    export POD=$(kubectl -n $NAMESPACE get pod -l app.kubernetes.io/instance=$RELEASE_NAME -o jsonpath='{.items[0].metadata.name}')
    kubectl exec -n $NAMESPACE $POD -- nvidia-smi -L
    kubectl logs -n $NAMESPACE $POD | grep 'NIM Version'
    
  3. List the served model and send a test inference request from inside the pod:

    kubectl exec -n $NAMESPACE $POD -- curl -sS http://localhost:8000/v1/models | python3 -m json.tool
    
    kubectl exec -n $NAMESPACE $POD -- curl -sS http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"<MODEL_NAME>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
    

    Replace <MODEL_NAME> with the value advertised by /v1/models.

  4. To call the service from outside the cluster, use kubectl port-forward:

    kubectl port-forward -n $NAMESPACE svc/$RELEASE_NAME-nim-llm 8000:8000
    # In another terminal:
    curl http://localhost:8000/v1/health/ready
    

Teardown#

To remove all resources created by this guide:

# Remove the Helm release
helm uninstall $RELEASE_NAME --namespace $NAMESPACE

# Delete Kubernetes secrets
kubectl delete secret ngc-api $IMAGE_PULL_SECRET --namespace $NAMESPACE

# Delete the namespace (also deletes the chart-managed PVC)
kubectl delete namespace $NAMESPACE

# Delete the OKE cluster (also removes managed node pools)
oci ce cluster delete --cluster-id ${CLUSTER_OCID} --region ${OCI_REGION} --force

Note

Deleting the cluster does not automatically remove associated OCI Block Volumes or Load Balancers. Delete these manually from the OCI Console to avoid ongoing charges.