Is this page helpful?

Oracle#

This guide covers deploying NIM LLM on Oracle Cloud Infrastructure (OCI) using OKE (Oracle Kubernetes Engine), a managed Kubernetes service for running containerized workloads on OCI.

OKE Deployment#

Create an OKE cluster with GPU-capable worker nodes and prepare the OCI environment for NIM LLM workloads. This section covers prerequisites, cluster creation, and initial setup.

Prerequisites#

Install the following tools before proceeding:

You also need an NGC API key with access to NIM LLM container images and Helm charts.

Note

Your OCI tenancy must have GPU quota available in the target region. Verify quota in the OCI Console under Limits, Quotas and Usage and ensure the compartment has a configured VCN, subnets, internet gateway, route table, and security lists.

Note

Match the OCI GPU shape to the model you plan to serve. Larger models require more aggregate VRAM and may need multi-GPU shapes — 70B-class models can require up to 8 GPUs depending on precision and tensor parallelism. Refer to the OCI GPU shape documentation for available options.

Important

OKE’s default and recommended CNI is OCI VCN-Native Pod Networking, which assigns each pod its own routable VCN VNIC. Pod VNICs do not receive public IPv4 addresses, so a default Internet Gateway route cannot NAT outbound pod traffic, and pulls from HuggingFace, NGC, or any other public registry from inside the pod will time out.

If you use Quick Create, you must add a NAT Gateway and a dedicated private pod subnet routed through it before deploying NIM LLM. Refer to Networking Prerequisites below.

Networking Prerequisites#

Skip this section if you are bringing your own VCN with a NAT Gateway and a dedicated pod subnet already configured. Apply it once per Quick Create cluster.

export COMPARTMENT_OCID="${YOUR_COMPARTMENT_OCID}"
export VCN_OCID="${YOUR_VCN_OCID}"
export NODE_POOL_OCID="${YOUR_NODE_POOL_OCID}"

# Create a NAT Gateway in the VCN
NAT_OCID=$(oci network nat-gateway create \
  --compartment-id "$COMPARTMENT_OCID" \
  --vcn-id "$VCN_OCID" \
  --display-name "oke-nat-gw" \
  --wait-for-state AVAILABLE \
  --query 'data.id' --raw-output)

# Route table whose default route is the NAT Gateway
POD_RT=$(oci network route-table create \
  --compartment-id "$COMPARTMENT_OCID" \
  --vcn-id "$VCN_OCID" \
  --display-name "oke-podsubnet-nat-rt" \
  --route-rules '[{"destination":"0.0.0.0/0","destinationType":"CIDR_BLOCK","networkEntityId":"'"$NAT_OCID"'"}]' \
  --wait-for-state AVAILABLE --query 'data.id' --raw-output)

# Dedicated private pod subnet (no public IPs on VNICs)
DEFAULT_SL=$(oci network vcn get --vcn-id "$VCN_OCID" \
  --query 'data."default-security-list-id"' --raw-output)
POD_SUBNET=$(oci network subnet create \
  --compartment-id "$COMPARTMENT_OCID" \
  --vcn-id "$VCN_OCID" \
  --cidr-block "10.0.30.0/24" \
  --display-name "oke-podsubnet-nat" \
  --route-table-id "$POD_RT" \
  --security-list-ids '["'"$DEFAULT_SL"'"]' \
  --prohibit-public-ip-on-vnic true \
  --wait-for-state AVAILABLE --query 'data.id' --raw-output)

# Point the node pool at the new pod subnet
oci ce node-pool update --node-pool-id "$NODE_POOL_OCID" \
  --pod-subnet-ids '["'"$POD_SUBNET"'"]' \
  --force --wait-for-state SUCCEEDED

Important

The node pool update applies only to new nodes. If your GPU node was provisioned before the change, terminate it (or scale the node pool down and back up) so OKE replaces it with a node whose pod VNICs come from the NAT-routed subnet. Existing node IDs continue to use the old pod subnet until they are recycled.

After the new node is Ready, confirm pods scheduled on it land on the new subnet:

kubectl get pods -A -o wide
# Pod IPs should be in 10.0.30.0/24 (or whatever CIDR you chose for the pod subnet).

Create an OKE Cluster through OCI#

Set environment variables used throughout this guide:

export CLUSTER_OCID="${YOUR_CLUSTER_OCID}"
export NAMESPACE="nim-llm"
export NIM_LLM_CHART_VERSION="${YOUR_CHART_VERSION}"
export OCI_REGION="${YOUR_OCI_REGION}"
export RELEASE_NAME="my-nim"

Create an OKE cluster with a GPU-capable node pool using the OCI Console:

Navigate to Developer Services > Kubernetes Clusters (OKE) > Create cluster and select Quick create.
Configure a public API endpoint and choose a GPU shape appropriate for your model.
Under Advanced options, set the boot volume size to at least 500 GB. The OCI default (approximately 47 GB, presenting as approximately 35 GB inside the OS) is exhausted by NIM LLM container images and the model cache during a single deployment.
Submit and wait for the cluster to provision.

Note

For production workloads, use an Enhanced cluster (includes a financially-backed SLA).

After the cluster is created, configure kubectl access:

oci ce cluster create-kubeconfig \
  --cluster-id ${CLUSTER_OCID} \
  --file $HOME/.kube/config \
  --region ${OCI_REGION} \
  --token-version 2.0.0 \
  --kube-endpoint PUBLIC_ENDPOINT

Verify connectivity by running:

kubectl get nodes

Expand the Boot Volume#

OKE does not automatically grow the on-disk filesystem inside the OS, even when the boot volume itself is sized correctly at the node-pool level. Expand the file system before deployment. Without this step, the node hits DiskPressure during deployment, and pods are evicted.

Check first:

kubectl describe nodes | grep ephemeral-storage | head -1

If ephemeral-storage reads as approximately 35 GB (about 37206272Ki), grow the filesystem in-place using a privileged pod (no SSH access required):

NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')

kubectl run growfs --rm -it --restart=Never --privileged \
  --overrides='{"spec":{"hostPID":true,"nodeName":"'$NODE'","tolerations":[{"operator":"Exists"}]}}' \
  --image=docker.io/library/oraclelinux:8 \
  -- nsenter -t 1 -m -u -i -n /usr/libexec/oci-growfs -y

kubectl run restart-kubelet --rm -it --restart=Never --privileged \
  --overrides='{"spec":{"hostPID":true,"nodeName":"'$NODE'","tolerations":[{"operator":"Exists"}]}}' \
  --image=docker.io/library/oraclelinux:8 \
  -- nsenter -t 1 -m -u -i -n systemctl restart kubelet

Re-run the kubectl describe nodes check to confirm ephemeral-storage now matches the boot volume size.

Install the NVIDIA GPU Operator#

The NVIDIA GPU Operator manages the GPU device plugin, container toolkit, and runtime class on OKE nodes. OKE GPU node images ship with the NVIDIA driver pre-installed, so the operator is deployed with the bundled driver disabled:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=true

OKE GPU node pools occasionally attach a nvidia.com/gpu:NoSchedule taint to the node. The chart tolerates the taint, but clearing it lets non-GPU workloads share the node:

kubectl taint nodes --all nvidia.com/gpu:NoSchedule- 2>/dev/null || true

Verify the operator is running and GPUs are visible to the cluster:

kubectl get pods -n gpu-operator
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.allocatable.nvidia\.com/gpu}{" GPU(s)\n"}{end}'

If a node advertises fewer GPUs than its shape provides, restart the device plugin:

kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset

For more information, refer to the NVIDIA GPU Operator documentation.

Create Kubernetes Secrets#

The Helm chart requires two secrets in the deployment namespace. Missing either one causes the pod to fail with CreateContainerConfigError after scheduling:

ngc-secret (type docker-registry) is used by the kubelet to pull images from nvcr.io.
ngc-api (type generic) is mounted as the NGC_API_KEY environment variable so the pod can authenticate to NGC for the model download.

export IMAGE_PULL_SECRET="ngc-secret"
export NGC_API_KEY="${YOUR_NGC_API_KEY}"

kubectl create namespace $NAMESPACE

kubectl create secret docker-registry $IMAGE_PULL_SECRET \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

Confirm both secrets exist and ngc-api is non-empty:

kubectl get secret -n $NAMESPACE
kubectl get secret ngc-api -n $NAMESPACE -o jsonpath='{.data.NGC_API_KEY}' | base64 -d | wc -c

Create a Persistent Volume Claim#

NIM LLM requires persistent storage to cache the downloaded model across pod restarts. The Helm chart provisions a PVC automatically when you set persistence.enabled=true on helm install (covered in the next section). The PVC uses the cluster’s default storage class, which is oci-bv on OKE for OCI Block Volume.

For faster cold starts on large models, optionally create an Ultra High Performance Block Volume storage class and reference it with persistence.storageClass:

kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: oci-bv-uhp
provisioner: blockvolume.csi.oraclecloud.com
parameters:
  vpusPerGB: "30"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete
EOF

Deploy NIM LLM with Helm#

Fetch the Helm chart from NGC:

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-${NIM_LLM_CHART_VERSION}.tgz

Optional: View the default chart values to understand available configuration options.
```
helm show values nim-llm-${NIM_LLM_CHART_VERSION}.tgz
```
Tip

For help choosing the right model configuration, refer to Model Profiles and Selection.
Install the chart with inline --set flags. The following example uses nvcr.io/nim/meta/llama-3.1-8b-instruct on a single GPU. Adjust the image, GPU count, and persistence.size to match your model.
```
helm install $RELEASE_NAME nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
  --namespace $NAMESPACE \
  --set image.repository=nvcr.io/nim/meta/llama-3.1-8b-instruct \
  --set image.tag="$NIM_LLM_CHART_VERSION" \
  --set 'resources.limits.nvidia\.com/gpu=1' \
  --set persistence.enabled=true \
  --set persistence.size=80Gi \
  --set-string 'env[0].name=NIM_MAX_MODEL_LEN' --set-string 'env[0].value=32768'
```
Note

The chart’s default persistence.size of 50Gi does not provide enough headroom for the model weights and the container image cache. Use 80Gi as a safe minimum for approximately 10 GB to 16 GB model weights, and increase the size proportionally for larger models. NIM_MAX_MODEL_LEN caps the KV-cache allocation to fit the available VRAM. Adjust per your shape’s GPU memory.

Note

If you created the optional oci-bv-uhp storage class in the previous section, add --set persistence.storageClass=oci-bv-uhp to the command above to provision the PVC on Ultra High Performance block storage. Omit this flag to use the cluster default (oci-bv).

Verify the Deployment#

Watch the pod come up:
```
kubectl get pods -n $NAMESPACE -w
```
First-time startup includes image pull, model download from NGC, and engine initialization. Expect five to ten minutes on the oci-bv-uhp storage class and 15 to 25 minutes on default oci-bv.

Confirm the chart and NIM LLM versions, and that the GPU is bound to the pod:

helm list -n $NAMESPACE
export POD=$(kubectl -n $NAMESPACE get pod -l app.kubernetes.io/instance=$RELEASE_NAME -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n $NAMESPACE $POD -- nvidia-smi -L
kubectl logs -n $NAMESPACE $POD | grep 'NIM Version'

List the served model and send a test inference request from inside the pod:

kubectl exec -n $NAMESPACE $POD -- curl -sS http://localhost:8000/v1/models | python3 -m json.tool

kubectl exec -n $NAMESPACE $POD -- curl -sS http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<MODEL_NAME>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'

Replace <MODEL_NAME> with the value advertised by /v1/models.

To call the service from outside the cluster, use kubectl port-forward:

kubectl port-forward -n $NAMESPACE svc/$RELEASE_NAME-nim-llm 8000:8000
# In another terminal:
curl http://localhost:8000/v1/health/ready

Note

The chart’s built-in helm test "$RELEASE_NAME" validation uses curlimages/curl:8.6.0 from Docker Hub by short name. This works on OKE node images whose container runtime resolves unqualified references against Docker Hub by default — including standard OKE worker nodes (containerd) and OKE GPU nodes that ship with permissive CRI-O short-name resolution. OKE Gen2 GPU node images run CRI-O 1.34 with strict short-name resolution and require fully-qualified image references; on those node pools, validate inference with the kubectl exec and kubectl port-forward commands above instead — they cover the same /v1/health/ready, /v1/models, and /v1/chat/completions endpoints the chart’s test pods exercise.

Teardown#

To remove all resources created by this guide:

# Remove the Helm release
helm uninstall $RELEASE_NAME --namespace $NAMESPACE

# Delete Kubernetes secrets
kubectl delete secret ngc-api $IMAGE_PULL_SECRET --namespace $NAMESPACE

# Delete the namespace (also deletes the chart-managed PVC)
kubectl delete namespace $NAMESPACE

# Delete the OKE cluster (also removes managed node pools)
oci ce cluster delete --cluster-id ${CLUSTER_OCID} --region ${OCI_REGION} --force

If you created a dedicated NAT-routed pod subnet for this deployment (refer to Networking Prerequisites), remove the subnet, route table, and NAT Gateway after the cluster is deleted:

oci network subnet delete --subnet-id $POD_SUBNET --force --wait-for-state TERMINATED
oci network route-table delete --rt-id $POD_RT --force --wait-for-state TERMINATED
oci network nat-gateway delete --nat-gateway-id $NAT_OCID --force --wait-for-state TERMINATED

Note

Deleting the cluster does not automatically remove associated OCI Block Volumes, Load Balancers, NAT Gateways, or custom subnets. Delete these manually from the OCI Console to avoid ongoing charges.