NeMo Microservices Prerequisites#
NeMo microservices have some shared prerequisites to configure on your Kubernetes cluster before deploying a NeMo microservice with the NVIDIA NIM Operator. The prerequisites include:
Creating a namespace to deploy your NeMo microservices.
Creating image pull secrets that contain your NGC API key.
Optionally, deploy NeMo dependencies with Ansible. Each NeMo microservice has per-component dependencies in addition to the prerequisites on this page. The NIM Operator maintains Ansible playbooks to help you quickly deploy these microservice-specific dependencies to test the NeMo microservices.
Optionally, if you are installing NeMo Customizer, install NeMo Operator which provides custom resources that help manage NeMo Customizer jobs.
After configuring the prerequisites on this page, you can continue to the NeMo microservices deployment guide or deploy each microservice individually by reviewing the microservice-specific page.
1. Create NeMo Namespace#
It’s recommended to deploy all NeMo microservices in a single namespace:
$ kubectl create namespace nemo
You should also consider applying a Kubernetes resource quota to your NeMo namespace. If you deploy using the Ansible NeMo Dependency playbook, a default nemo
namespace will be created by the playbook.
2. Create NGC API Key Image Pull Secrets#
You must create the required image pull secrets in your nemo
namespace to be able to pull NeMo microservice images from NVIDIA NGC.
Refer to Generating Your NGC API Key
in the NVIDIA NGC User Guide for more information.
Add a Docker registry secret for downloading container images from NVIDIA NGC:
$ kubectl create secret -n nemo docker-registry ngc-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<ngc-api-key>
Add a generic secret that the model puller containers use to download models from NVIDIA NGC:
$ kubectl create secret -n nemo generic ngc-api-secret \ --from-literal=NGC_API_KEY=<ngc-api-key>
3. Deploy NeMo Dependencies with Ansible#
Each NeMo microservice relies on several dependencies, for things like databases or Kubernetes secrets. The NIM Operator team maintains Ansible playbooks to help you quickly install most dependencies on your cluster.
Note
While these playbooks are helpful for testing, they may not be suitable for production environments. Refer to each microservice’s configuration page for specific dependency requirements.
Using this playbook you can choose to install dependencies for one or more of the following NeMo microservices:
NeMo Microservice |
Dependencies Deployed with Ansible |
Additional Dependencies You Create |
---|---|---|
NeMo Data Store |
PostgreSQL, MinIO, Kubernetes storage provisioner*, Kubernetes secrets (database user, object storage user, Data Store default) |
Image pull secret |
NeMo Entity Store |
PostgreSQL, Kubernetes storage provisioner*, Kubernetes secrets (database user) |
Image pull secret, NeMo Data Store, NeMo Entity Store |
NeMo Evaluator |
PostgreSQL, Kubernetes secrets (database user), Argo Workflows, Milvus |
Image pull secret, NeMo Data Store, NeMo Entity Store |
NeMo Customizer |
PostgreSQL, Kubernetes storage provisioner*, Kubernetes secrets (database user, W&B API Key), Kubernetes ConfigMap for training and model downloads, OpenTelemetry, Volcano Scheduler |
Image pull secret, NeMo Data Store, NeMo Entity Store, NeMo Operator |
NeMo Guardrails |
Kubernetes storage provisioner*, NIM Endpoint as a NIMPipeline |
Image pull secret |
Note
*By default the NeMo Dependency playbook will install Local Path Provisioner to use as the default StorageClass.
You can configure a different by specifying your desired pvc
, described in more detail below.
Install Dependencies#
Clone the NIM Operator repository:
$ git clone https://github.com/NVIDIA/k8s-nim-operator.git
Navigate to the
nemo-dependencies
directory:$ cd k8s-nim-operator/test/e2e/nemo-dependencies
Configure the playbook to install all dependencies by updating each microservice to
yes
in thevalues.yaml
:install: customizer: yes datastore: yes entity_store: yes evaluator: yes jupyter: yes #Deploys jupyter server to use with jupyter notebook tutorial. Change to `no` if you don't want this deployed. uninstall: customizer: yes datastore: yes entity_store: yes evaluator: yes jupyter: yes installation_namespace: nemo # Specify a custom storage class and volume access mode for all PVCs, for e.g. ReadWriteMany for nfs storage class pvc: # Ignored when localPathProvisioner.enabled is true — in which case these will default to: # storage_class: "local-path" and volume_access_mode: "ReadWriteOnce" storage_class: "" volume_access_mode: ReadWriteOnce # Deploy a local-path CSI provisioner localPathProvisioner: # disable it when a different CSI provisioner is already deployed in the cluster enabled: true default: true version: v0.0.31
Additional configuration options:
Change the namespace to deploy all NeMo dependencies.
By default, the playbook creates and deploys services into thenemo
namespace. You can change this by updating theinstallation_namespace
.By default the playbook will deploy Local Path Provisioner to use as the default StorageClass. If a default StorageClass is already provisioned in the cluster, set
localPathProvisioner.enabled:false
then specify a custom StorageClass and volume access mode the playbook should use for all PVCs, for exampleReadWriteMany
for NFS storage class.
Refer to the NeMo dependencies documentation for full details on available configuration options.
Run the Ansible playbook:
$ ansible-playbook -c local -i localhost install.yaml
The playbook will take several minutes to complete.
Verify Dependencies#
Check that pods are running:
$ kubectl get pods -n nemo
Example output
NAME READY STATUS RESTARTS AGE argo-workflows-server-85d8489c58-l5fnc 1/1 Running 0 10m argo-workflows-workflow-controller-698f7bb767-dhr9l 1/1 Running 0 10m customizer-otel-opentelemetry-collector-7ff98567c5-slg8v 1/1 Running 0 10m customizer-pg-postgresql-0 1/1 Running 0 10m datastore-pg-postgresql-0 1/1 Running 0 10m entity-store-pg-postgresql-0 1/1 Running 0 10m evaluator-otel-opentelemetry-collector-6cf75b448-f2m6h 1/1 Running 0 10m evaluator-pg-postgresql-0 1/1 Running 0 10m jupyter-notebook-f4cdbc988-cgc8z 1/1 Running 0 10m meta-llama3-1b-instruct-5cbd55b49b-nrt9j 1/1 Running 0 10m milvus-standalone-8fbb48495-dfrhr 1/1 Running 0 10m mlflow-minio-568b6bc597-nx647 1/1 Running 0 10m mlflow-minio-provisioning-vfvr9 0/1 Completed 0 10m mlflow-postgresql-0 1/1 Running 0 10m mlflow-tracking-6fbc46b567-6q6n8 1/1 Running 0 10m volcano-admission-5c5c96b944-pfltc 1/1 Running 0 10m volcano-admission-init-stkhm 0/1 Completed 0 10m volcano-controllers-699b864756-rnvqb 1/1 Running 0 10m volcano-scheduler-5f77fc8fb9-cjqht 1/1 Running 0 10m
Verify the Kubernetes secrets for the dependencies have been created:
$ kubectl get secrets -n nemo
View all the NIM microservices:
$ kubectl get -n nemo nimpipeline,nimcache,nimservice
Example output
NAME STATUS AGE nimpipeline.apps.nvidia.com/llama3-1b-pipeline Ready 40m NAME STATUS PVC AGE nimcache.apps.nvidia.com/meta-llama3-1b-instruct Ready meta-llama3-1b-instruct-pvc 40m NAME STATUS AGE nimservice.apps.nvidia.com/meta-llama3-1b-instruct Ready 40m
4. Install NeMo Operator#
When you want to run a NeMo Customizer workflow, you need to install the NeMo Operator microservice. This microservice manages custom resources around LLM training workload for NeMo Customizer jobs. It does not manage any NeMo microservice. Refer to the NeMo microservice documentation for details about the NeMo Operator and the training CRDs is manages.
Prerequisites#
Create the image pull secrets in the
nemo
namespace, or the namespace you plan to install the NeMo Operator. You also need to pass your NVIDIA NGC API Key to fetch the NeMo Operator Helm chart.Access to an NFS-backed Persistent Volume that supports
ReadWriteMany
access mode. The NeMo Operator microservice dynamically provisions NFS-backed persistent volumes using Kubernetes storage classes.Install Volcano scheduler on your cluster. Use the NeMo Dependency Ansible playbooks to install Volcano, or refer to the Volcano install documentation.
Its recommend that you use the latest version of the NeMo Operator. View available NeMo Operator versions on NVIDIA NGC and update the
VERSION
variable to pull your desired version.$ export VERSION=25.4.0
Install the NeMo Operator with Helm#
Fetch the NeMo Operator Helm chart.
$ helm fetch https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nemo-operator-${VERSION}.tgz --username='$oauthtoken' --password=<YOUR NGC API Key>
Install the NeMo Operator.
$ helm upgrade --install nemo-operator nemo-operator-${VERSION}.tgz -n nemo --set imagePullSecrets[0].name=ngc-secret --set controllerManager.manager.scheduler=volcano
Verify the NeMo Operator was installed.
$ kubectl get pods -n nemo | grep "nemo-operator"