Install NeMo Platform Helm Chart#
Tip
Note: This setup is the full enterprise platform, meant for advanced use. If you’re just getting started, check out Quickstart instead — it’s faster and easier to explore the basics.
To deploy the NeMo Platform, follow these steps after completing the Prerequisites.
Download the NeMo Platform Helm Chart tarball from the NGC private registry.
ngc registry chart pull --org nvidian '0857255566152269/external/nemo-platform:2.0.1'
This produces
nemo-platform-2.0.1.tar.gzin the current directory.Review the default values in the NeMo Platform Helm Chart reference. To override the default values, create a custom values file. Review the following while creating your custom values file.
To configure an external database, see Database Setup.
To configure persistent volumes for jobs and files storage, see Persistent Volumes.
To configure file storage options, see File Storage.
To configure ingress, see Ingress.
To configure multi-node networking, see Multinode Networking.
To configure OpenShift-compatible security context overrides, see OpenShift.
Install the Volcano scheduler before installing the chart. This is required for customization jobs that leverage multiple nodes.
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.9.0/installer/volcano-development.yaml
After applying, wait for the Volcano admission webhook to finish initializing before proceeding. The webhook registers immediately with
failurePolicy: Fail, but TLS certificate generation runs asynchronously. If you proceed before the webhook is ready, all pod creation — including the NeMo Platform Helm install — will fail with a certificate error.kubectl wait --for=condition=complete job/volcano-admission-init -n volcano-system --timeout=120s kubectl rollout status deployment/volcano-admission -n volcano-system
Install the chart from the downloaded tarball, optionally overriding values with your
values.yaml:helm upgrade --install nemo-platform \ --namespace $NAMESPACE \ nemo-platform-2.0.1.tar.gz \ -f values.yaml
The installation process takes approximately 10 minutes for image downloads, container startup, and communication establishment. The time it takes might vary depending on the speed of your network connection. Pods might appear in pending or restarting states during the installation process.
Note
The early access NGC org distributes platform images through the private registry. Services may attempt to pull task images (for example,
customizer-automodel,customizer-rl,customizer-tasks,nmp-cpu-tasks,nmp-gpu-tasks) on demand. If those pulls fail, see Pre-pull task images for the early access org below.Verify the pod status:
kubectl get pods
Pre-pull task images for the early access org#
Some services launch task containers (Customizer, Safe Synthesizer, Evaluator, Data Designer) on demand. The runtime cannot pull these from the early access private registry on the fly, so make them available on each node ahead of time.
Pull the images once with your NGC key, then either push them to a registry your cluster can read or load them onto each node. Skip any image whose service you do not plan to use.
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin
docker pull nvcr.io/0857255566152269/external/customizer-automodel:26.03.1
docker pull nvcr.io/0857255566152269/external/customizer-rl:26.03.1
docker pull nvcr.io/0857255566152269/external/customizer-tasks:26.03.1
docker pull nvcr.io/0857255566152269/external/nmp-cpu-tasks:26.03.1
docker pull nvcr.io/0857255566152269/external/nmp-gpu-tasks:26.03.1
If a service later fails with 403 Forbidden from nvcr.io while pulling an image (for example, an embedding NIM under nvcr.io/nim/...), the same authentication scope cannot fetch that image from the private registry. Either pre-pull it as above, or use a non-NIM model deployment.
For Red Hat OpenShift, use OpenShift-compatible security context overrides so pods satisfy the restricted SCC. See OpenShift.
To upgrade the deployment with new configurations, use the following command:
helm upgrade nemo-platform nemo-platform-2.0.1.tar.gz -f values.yaml
To uninstall the deployment, use the following command:
helm uninstall nemo-platform
Note
helm uninstall intentionally does not remove all resources:
PVCs are preserved to prevent accidental data loss. Delete them manually if no longer needed.
CRDs are not removed by Helm design (upstream issue) to avoid destroying custom resources across the cluster.
Completed jobs, secrets, and the namespace may also remain.
If you need a complete teardown (e.g., for CI/CD pipelines or reinstalling in the same namespace), run the following after helm uninstall:
# Delete the namespace — removes all namespace-scoped resources (pods, PVCs, jobs, secrets, etc.)
kubectl delete namespace <namespace>
# Then delete custom resources as needed, though generally this is not required
kubectl delete crd <crd_name>
Warning
Deleting CRDs removes all custom resources of those types cluster-wide. Only do this if no other workloads depend on them.
Troubleshooting#
Volcano admission webhook blocks pod creation#
Symptom: Pod creation fails cluster-wide with an error like:
Internal error occurred: failed calling webhook "mutatepod.volcano.sh":
failed to call webhook: Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s":
tls: failed to verify certificate: x509: certificate signed by unknown authority
Cause: The Volcano MutatingWebhookConfiguration registers with failurePolicy: Fail before the volcano-admission-init job finishes generating TLS certificates. This affects all namespaces, not just Volcano workloads.
Fix: Wait for the webhook to become functional, then restart the admission deployment to force certificate regeneration:
kubectl rollout restart deployment/volcano-admission -n volcano-system
kubectl rollout status deployment/volcano-admission -n volcano-system
Verify the webhook is accepting requests before retrying your Helm install:
until kubectl run volcano-webhook-test --image=busybox --restart=Never --dry-run=server -o yaml 2>/dev/null; do
echo "Volcano webhook not ready yet, waiting..."
sleep 5
done
kubectl delete pod volcano-webhook-test --ignore-not-found=true