NeMo Customizer Microservice Deployment Guide#

NeMo Customizer is as a lightweight API server to run managed training jobs on GPU nodes using Volcano scheduler.

Prerequisites#

Before installing NeMo Customizer, make sure that you have all of the following:

Minimum System Requirements

  • A single-node Kubernetes cluster on a Linux host and cluster-admin level permissions.

  • At least 200 GB of free disk space.

  • At least one dedicated GPUs (A100 80 GB or H100 80 GB)

Storage

Kubernetes


Values Setup for Installing NeMo Customizer#

If you want to install NeMo Customizer as a standalone microservice, you need to configure the following value overrides in the values.yaml file.

tags:
  platform: false
  customizer: true

NeMo Customizer requires the following NeMo microservices installed:


Multi-node Training#

To enable multi-node training on cloud providers, install Kyverno as a dependency.

  1. Add Kyverno to your helm repository.

helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
  1. Install Kyverno.

helm upgrade -i kyverno kyverno/kyverno -n kyverno --create-namespace --version 3.1.5

Multi-node training on AWS with EFA#

  1. Define your initial values.yaml file.

  2. Install NeMo Customizer with the awsDeploy.enabled=true:

    helm --namespace nemo-customizer install nemo-customizer \
       nemo-microservices-helm-chart \
       -f <path-to-your-values-file> \
       --set awsDeploy.enabled=true
    

Multi-node training on Azure#

  1. Define your initial values.yaml file.

  2. Install NeMo Customizer with the azureDeploy.enabled=true:

    helm --namespace nemo-customizer install nemo-customizer \
       nemo-microservices-helm-chart \
       -f <path-to-your-values-file> \
       --set azureDeploy.enabled=true
    

Multi-node training on GCP#

  1. Install the NeMo platform Helm chart with GCP-specific configurations:

    helm --namespace default install \
       nemo nmp/nemo-microservices-helm-chart \
       --set tags.platform=false \
       --set tags.customizer=true \
       --set gcpDeploy.enabled=true \
       --set customizer.customizerConfigs.training.pvc.storageClass=<YOUR_STORAGE_CLASS>
    

    Note

    Ensure that NIM inference image tag is above 1.8.3. If you use NIM image tag <= 1.8.3, you also need to provide the following env variable to nim: LD_LIBRARY_PATH=/usr/local/nvidia/lib64.

    helm --namespace default install \
       nemo nmp/nemo-microservices-helm-chart \
       --set tags.platform=false \
       --set tags.customizer=true \
       --set gcpDeploy.enabled=true \
       --set nim.env[0].name=NIM_PEFT_SOURCE \
       --set nim.env[0].value=http://nemo-entity-store:8000 \
       --set nim.env[1].name=NIM_PEFT_REFRESH_INTERVAL \
       --set nim.env[1].value="30" \
       --set nim.env[2].name=NIM_MAX_CPU_LORAS \
       --set nim.env[2].value="16" \
       --set nim.env[3].name=NIM_MAX_GPU_LORAS \
       --set nim.env[3].value="8" \
       --set nim.env[4].name=LD_LIBRARY_PATH \
       --set nim.env[4].value=/usr/local/nvidia/lib64 \
       --set customizer.customizerConfigs.training.pvc.storageClass=<YOUR_STORAGE_CLASS>
    

    You can also create a gcp-values.yaml file with the following configuration:

    tags:
      platform: false
      customizer: true
    
    gcpDeploy:
      enabled: true
    
    customizer:
      customizerConfigs:
        training:
          pvc:
            # Replace <YOUR_STORAGE_CLASS> with an appropriate value
            storageClass: <YOUR_STORAGE_CLASS>
    

    Then install using:

    helm --namespace default install \
       nemo nmp/nemo-microservices-helm-chart \
       -f gcp-values.yaml
    

    Note

    Replace <YOUR_STORAGE_CLASS> with your actual GCP storage class name, such as standard-rwo or premium-rwo.

Multi-node training on OCI#

  1. Define your initial values.yaml file.

  2. Install NeMo Customizer with the ociDeploy.enabled=true:

    helm --namespace nemo-customizer install nemo-customizer \
       nemo-microservices-helm-chart \
       -f <path-to-your-values-file> \
       --set ociDeploy.enabled=true
    

Support Matrix#

Cloud Provider

High-Performance Networking

Details

Tested Environment

AWS

EFA

Managed through Kyverno

p5.48xlarge, EFA-supported GPU instances

Azure

InfiniBand (RDMA)

Managed through Kyverno

Standard_ND96amsr_A100_v4

GCP

TCP-X, TCP-XO

Managed through Kyverno

a3-megagpu-8g with NVIDIA H100 80GB MEGA

OCI

RDMA (RoCE)

Managed through Kyverno

BM.GPU.A100

Verifying policy setup#

To check Kyverno policy application:

kubectl get policy
kubectl describe policy customizer-eks-efa-configs  # For AWS
kubectl describe policy customizer-azure-rdma-nccl-configs  # For Azure
kubectl describe policy customizer-gcp-tcpxo-nccl-configs  # For Azure
kubectl describe policy customizer-oci-rdma-nccl-configs # For OCI

Configure Features#

NeMo Customizer utilizes several services that you can deploy independently or test with default subcharts and values.

Queue Executor#

You have two options for the queue executor: Volano and Run:AI.

Volcano#

Install Volcano#

Install Volcano scheduler before installing the chart:

  1. Install Volcano scheduler:

    kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.9.0/installer/volcano-development.yaml
    
  2. Install the NeMo platform Helm chart:

    helm --namespace default install \
       nemo nmp/nemo-microservices-helm-chart
       --set tags.platform=false
       --set tags.customizer=true
    

For GCP deployments, you must configure additional settings after installing Volcano:

  1. Install Volcano scheduler:

    kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.9.0/installer/volcano-development.yaml
    
  2. Install ResourceQuota for critical pods:

    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: critical-pods
      namespace: volcano-system
    spec:
      hard:
        pods: "100"
      scopeSelector:
        matchExpressions:
          - operator: In
            scopeName: PriorityClass
            values:
              - system-node-critical
              - system-cluster-critical
    EOF
    
  3. Wait for Volcano to create the “default” queue (5-10 minutes):

    kubectl get queue-v1beta1
    
Customize Volcano Queue#

In your custom values file for the NeMo Microservices Helm Chart, you can configure a Volcano queue for NeMo Customizer training jobs. The queue must have gpu and mlnxnics capabilities to schedule training jobs.

Tip

For more information about the Volcano queue, refer to Queue in the Volcano documentation.

The NeMo Microservices Helm Chart has default values for setting up a default Volcano queue. Set up the Volcano configuration values as follows:

  • If you want to use the default queue pre-configured in the chart, set volcano: enabled: true and keep customizer.customizerConfig.training.queue set to "default".

  • If you want to use your own Volcano queue, set volcano: enabled: false and specify the Volcano queue name to customizer.customizerConfig.training.queue.

Run:AI#

Alternatively, you can use Run:AI as the queue and executor for NeMo Customizer.

To configure NeMo Customizer to use the Run:AI executor, add the following manifest snippet to your custom values file: customizer.runai.override.values.yaml. This sample manifest is for cases where you use the NeMo Microservices Helm Chart. Adapt your custom values files accordingly if you want to install the microservices individually.

Weights & Biases in Run:AI#

If configuring Weights & Biases, you need to update the following with your keys in the customizer.runai.override.values.yaml file:

customizer:
  customizerConfig:
    training:
      container_defaults:
        env:
          - name: WANDB_API_KEY
            value: 'xxx'
          - name: WANDB_ENCRYPTED_API_KEY
            value: 'xxx'
          - name: WANDB_ENCRYPTION_KEY
            value: 'xxx'

Note

For configuring Weights & Biases while using Volcano, refer to the Metrics tutorial

MLflow#

You can configure NeMo Customizer to use MLflow to monitor training jobs. You need to deploy MLflow and set up the connection with the NeMo Customizer microservice.

  1. Create a mlflow.values.yaml file.

    postgresql:
      enabled: true
      auth:
        username: "bn_mlflow"
        password: "bn_mlflow"
    
    tracking:
      enabled: true
      auth:
        enabled: false
      runUpgradeDB: false
      service:
        type: ClusterIP
      resourcesPreset: medium
    
    run:
      enabled: false
    
  2. Install MLflow using helm.

    helm install -n mlflow-system --create-namespace mlflow oci://registry-1.docker.io/bitnamicharts/mlflow --version 1.0.6 -f mlflow.values.yaml
    
  3. Integrate NeMo Customizer with MLflow by setting customizerConfig.mlflowURL in values.yaml.

    customizerConfig:
      # mlflowURL is the internal K8s DNS record for the mlflow service.
      # Example: "http://mlflow-tracking.mlflow-system.svc.cluster.local:80"
      mlflowURL: ""
    

WandB#

You can customize WandB configuration for NeMo Customizer to log data under specific team or project as follows.

customizerConfig:
  # -- Weights and Biases (WandB) Python SDK intialization configuration for logging and monitoring training jobs in WandB.
  wandb:
  # -- The username or team name under which the runs will be logged.
  # -- If not specified, the run will default to a default entity set in the account settings.
  # -- To change the default entity, go to the account settings https://wandb.ai/settings
  # -- and update the “Default location to create new projects” under “Default team”.
  # -- Reference: https://docs.wandb.ai/ref/python/init/
  entity: null
  # The name of the project under which this run will be logged
  project: "nvidia-nemo-customizer"