NeMo Microservices Prerequisites#

NeMo microservices have some shared prerequisites to configure on your Kubernetes cluster before deploying a NeMo microservice with the NVIDIA NIM Operator. The prerequisites include:

  1. Creating a namespace to deploy your NeMo microservices.

  2. Creating image pull secrets that contain your NGC API key.

  3. Optionally, deploy NeMo dependencies with Ansible. Each NeMo microservice has per-component dependencies in addition to the prerequisites on this page. The NIM Operator maintains Ansible playbooks to help you quickly deploy these microservice-specific dependencies to test the NeMo microservices.

  4. Optionally, if you are installing NeMo Customizer, install NeMo Operator which provides custom resources that help manage NeMo Customizer jobs.

After configuring the prerequisites on this page, you can continue to the NeMo microservices deployment guide or deploy each microservice individually by reviewing the microservice-specific page.

1. Create NeMo Namespace#

It’s recommended to deploy all NeMo microservices in a single namespace:

$ kubectl create namespace nemo

You should also consider applying a Kubernetes resource quota to your NeMo namespace. If you deploy using the Ansible NeMo Dependency playbook, a default nemo namespace will be created by the playbook.

2. Create NGC API Key Image Pull Secrets#

You must create the required image pull secrets in your nemo namespace to be able to pull NeMo microservice images from NVIDIA NGC. Refer to Generating Your NGC API Key in the NVIDIA NGC User Guide for more information.

  1. Add a Docker registry secret for downloading container images from NVIDIA NGC:

    $ kubectl create secret -n nemo docker-registry ngc-secret \
        --docker-server=nvcr.io \
        --docker-username='$oauthtoken' \
        --docker-password=<ngc-api-key>
    
  2. Add a generic secret that the model puller containers use to download models from NVIDIA NGC:

    $ kubectl create secret -n nemo generic ngc-api-secret \
        --from-literal=NGC_API_KEY=<ngc-api-key>
    

3. Deploy NeMo Dependencies with Ansible#

Each NeMo microservice relies on several dependencies, for things like databases or Kubernetes secrets. The NIM Operator team maintains Ansible playbooks to help you quickly install most dependencies on your cluster.

Note

While these playbooks are helpful for testing, they may not be suitable for production environments. Refer to each microservice’s configuration page for specific dependency requirements.

Using this playbook you can choose to install dependencies for one or more of the following NeMo microservices:

NeMo Microservice

Dependencies Deployed with Ansible

Additional Dependencies You Create

NeMo Data Store

PostgreSQL, MinIO, Kubernetes storage provisioner*, Kubernetes secrets (database user, object storage user, Data Store default)

Image pull secret

NeMo Entity Store

PostgreSQL, Kubernetes storage provisioner*, Kubernetes secrets (database user)

Image pull secret, NeMo Data Store, NeMo Entity Store

NeMo Evaluator

PostgreSQL, Kubernetes secrets (database user), Argo Workflows, Milvus

Image pull secret, NeMo Data Store, NeMo Entity Store

NeMo Customizer

PostgreSQL, Kubernetes storage provisioner*, Kubernetes secrets (database user, W&B API Key), Kubernetes ConfigMap for training and model downloads, OpenTelemetry, Volcano Scheduler

Image pull secret, NeMo Data Store, NeMo Entity Store, NeMo Operator

NeMo Guardrails

Kubernetes storage provisioner*, NIM Endpoint as a NIMPipeline

Image pull secret

Note

*By default the NeMo Dependency playbook will install Local Path Provisioner to use as the default StorageClass. You can configure a different by specifying your desired pvc, described in more detail below.

Install Dependencies#

  1. Clone the NIM Operator repository:

    $ git clone https://github.com/NVIDIA/k8s-nim-operator.git
    
  2. Navigate to the nemo-dependencies directory:

    $ cd k8s-nim-operator/test/e2e/nemo-dependencies
    
  3. Configure the playbook to install all dependencies by updating each microservice to yes in the values.yaml:

    install:
      customizer: yes
      datastore: yes
      entity_store: yes
      evaluator: yes
      jupyter: yes #Deploys jupyter server to use with jupyter notebook tutorial. Change to `no` if you don't want this deployed.
    
    uninstall:
      customizer: yes
      datastore: yes
      entity_store: yes
      evaluator: yes
      jupyter: yes 
    
    installation_namespace: nemo
    
    # Specify a custom storage class and volume access mode for all PVCs, for e.g. ReadWriteMany for nfs storage class
    pvc:
      # Ignored when localPathProvisioner.enabled is true — in which case these will default to:
      # storage_class: "local-path" and volume_access_mode: "ReadWriteOnce"
      storage_class: ""
      volume_access_mode: ReadWriteOnce
    
    # Deploy a local-path CSI provisioner
    localPathProvisioner:
      # disable it when a different CSI provisioner is already deployed in the cluster
      enabled: true
      default: true
      version: v0.0.31
    

    Additional configuration options:

    • Change the namespace to deploy all NeMo dependencies.
      By default, the playbook creates and deploys services into the nemo namespace. You can change this by updating the installation_namespace.

    • By default the playbook will deploy Local Path Provisioner to use as the default StorageClass. If a default StorageClass is already provisioned in the cluster, set localPathProvisioner.enabled:false then specify a custom StorageClass and volume access mode the playbook should use for all PVCs, for example ReadWriteMany for NFS storage class.

    Refer to the NeMo dependencies documentation for full details on available configuration options.

  4. Run the Ansible playbook:

    $ ansible-playbook -c local -i localhost install.yaml
    

    The playbook will take several minutes to complete.

Verify Dependencies#

  1. Check that pods are running:

    $ kubectl get pods -n nemo
    

    Example output

    NAME                                                      READY   STATUS      RESTARTS   AGE
    argo-workflows-server-85d8489c58-l5fnc                    1/1     Running     0          10m
    argo-workflows-workflow-controller-698f7bb767-dhr9l       1/1     Running     0          10m
    customizer-otel-opentelemetry-collector-7ff98567c5-slg8v  1/1     Running     0          10m
    customizer-pg-postgresql-0                                1/1     Running     0          10m
    datastore-pg-postgresql-0                                 1/1     Running     0          10m
    entity-store-pg-postgresql-0                              1/1     Running     0          10m
    evaluator-otel-opentelemetry-collector-6cf75b448-f2m6h    1/1     Running     0          10m
    evaluator-pg-postgresql-0                                 1/1     Running     0          10m
    jupyter-notebook-f4cdbc988-cgc8z                          1/1     Running     0          10m
    meta-llama3-1b-instruct-5cbd55b49b-nrt9j                  1/1     Running     0          10m
    milvus-standalone-8fbb48495-dfrhr                         1/1     Running     0          10m
    mlflow-minio-568b6bc597-nx647                             1/1     Running     0          10m
    mlflow-minio-provisioning-vfvr9                           0/1     Completed   0          10m
    mlflow-postgresql-0                                       1/1     Running     0          10m
    mlflow-tracking-6fbc46b567-6q6n8                          1/1     Running     0          10m
    volcano-admission-5c5c96b944-pfltc                        1/1     Running     0          10m
    volcano-admission-init-stkhm                              0/1     Completed   0          10m
    volcano-controllers-699b864756-rnvqb                      1/1     Running     0          10m
    volcano-scheduler-5f77fc8fb9-cjqht                        1/1     Running     0          10m
    
  2. Verify the Kubernetes secrets for the dependencies have been created:

    $ kubectl get secrets -n nemo
    
  3. View all the NIM microservices:

    $ kubectl get -n nemo nimpipeline,nimcache,nimservice
    

    Example output

    NAME                                                 STATUS   AGE
    nimpipeline.apps.nvidia.com/llama3-1b-pipeline       Ready    40m
    
    NAME                                                 STATUS   PVC                           AGE
    nimcache.apps.nvidia.com/meta-llama3-1b-instruct     Ready    meta-llama3-1b-instruct-pvc   40m
    
    NAME                                                 STATUS   AGE
    nimservice.apps.nvidia.com/meta-llama3-1b-instruct   Ready    40m
    

4. Install NeMo Operator#

When you want to run a NeMo Customizer workflow, you need to install the NeMo Operator microservice. This microservice manages custom resources around LLM training workload for NeMo Customizer jobs. It does not manage any NeMo microservice. Refer to the NeMo microservice documentation for details about the NeMo Operator and the training CRDs is manages.

Prerequisites#

  • Create the image pull secrets in the nemo namespace, or the namespace you plan to install the NeMo Operator. You also need to pass your NVIDIA NGC API Key to fetch the NeMo Operator Helm chart.

  • Access to an NFS-backed Persistent Volume that supports ReadWriteMany access mode. The NeMo Operator microservice dynamically provisions NFS-backed persistent volumes using Kubernetes storage classes.

  • Install Volcano scheduler on your cluster. Use the NeMo Dependency Ansible playbooks to install Volcano, or refer to the Volcano install documentation.

  • Its recommend that you use the latest version of the NeMo Operator. View available NeMo Operator versions on NVIDIA NGC and update the VERSION variable to pull your desired version.

    $ export VERSION=25.4.0
    

Install the NeMo Operator with Helm#

  1. Fetch the NeMo Operator Helm chart.

    $ helm fetch https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nemo-operator-${VERSION}.tgz --username='$oauthtoken' --password=<YOUR NGC API Key>
    
  2. Install the NeMo Operator.

    $ helm upgrade --install nemo-operator nemo-operator-${VERSION}.tgz -n nemo --set imagePullSecrets[0].name=ngc-secret --set controllerManager.manager.scheduler=volcano 
    
  3. Verify the NeMo Operator was installed.

    $ kubectl get pods -n nemo | grep "nemo-operator"
    

Next Steps#