Install NeMo Platform Helm Chart#

Tip

Note: This setup is the full enterprise platform, meant for advanced use. If you’re just getting started, check out Quickstart instead — it’s faster and easier to explore the basics.

To deploy the NeMo Platform, follow these steps after completing the Prerequisites.

  1. Add the NeMo Platform Helm Chart to your local Helm repositories.

    helm repo add nmp https://helm.ngc.nvidia.com/nvidia/nemo-microservices \
       --username='$oauthtoken' \
       --password=$NGC_API_KEY
    
    helm repo update
    
  2. Review the default values in the NeMo Platform Helm Chart reference. To override the default values, create a custom values file. Review the following while creating your custom values file.

  3. Install the Volcano scheduler before installing the chart. This is required for customization jobs that leverage multiple nodes.

    kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.9.0/installer/volcano-development.yaml
    

    After applying, wait for the Volcano admission webhook to finish initializing before proceeding. The webhook registers immediately with failurePolicy: Fail, but TLS certificate generation runs asynchronously. If you proceed before the webhook is ready, all pod creation — including the NeMo Platform Helm install — will fail with a certificate error.

    kubectl wait --for=condition=complete job/volcano-admission-init -n volcano-system --timeout=120s
    kubectl rollout status deployment/volcano-admission -n volcano-system
    
  4. Install the chart using your values.yaml file:

    helm install nemo-platform nmp/nemo-platform -f values.yaml
    

    The installation process takes approximately 10 minutes for image downloads, container startup, and communication establishment. The time it takes might vary depending on the speed of your network connection. Pods might appear in pending or restarting states during the installation process.

  5. Verify the pod status:

    kubectl get pods
    

For Red Hat OpenShift, use OpenShift-compatible security context overrides so pods satisfy the restricted SCC. See OpenShift.

To upgrade the deployment with new configurations, use the following command:

helm upgrade nemo-platform nmp/nemo-platform -f values.yaml

To uninstall the deployment, use the following command:

helm uninstall nemo-platform

Note

helm uninstall intentionally does not remove all resources:

  • PVCs are preserved to prevent accidental data loss. Delete them manually if no longer needed.

  • CRDs are not removed by Helm design (upstream issue) to avoid destroying custom resources across the cluster.

  • Completed jobs, secrets, and the namespace may also remain.

If you need a complete teardown (e.g., for CI/CD pipelines or reinstalling in the same namespace), run the following after helm uninstall:

# Delete the namespace — removes all namespace-scoped resources (pods, PVCs, jobs, secrets, etc.)
kubectl delete namespace <namespace>

# Then delete custom resources as needed, though generally this is not required
kubectl delete crd <crd_name>

Warning

Deleting CRDs removes all custom resources of those types cluster-wide. Only do this if no other workloads depend on them.

Troubleshooting#

Volcano admission webhook blocks pod creation#

Symptom: Pod creation fails cluster-wide with an error like:

Internal error occurred: failed calling webhook "mutatepod.volcano.sh":
failed to call webhook: Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s":
tls: failed to verify certificate: x509: certificate signed by unknown authority

Cause: The Volcano MutatingWebhookConfiguration registers with failurePolicy: Fail before the volcano-admission-init job finishes generating TLS certificates. This affects all namespaces, not just Volcano workloads.

Fix: Wait for the webhook to become functional, then restart the admission deployment to force certificate regeneration:

kubectl rollout restart deployment/volcano-admission -n volcano-system
kubectl rollout status deployment/volcano-admission -n volcano-system

Verify the webhook is accepting requests before retrying your Helm install:

until kubectl run volcano-webhook-test --image=busybox --restart=Never --dry-run=server -o yaml 2>/dev/null; do
  echo "Volcano webhook not ready yet, waiting..."
  sleep 5
done
kubectl delete pod volcano-webhook-test --ignore-not-found=true