NeMo Customizer Microservice Deployment Guide#

NeMo Customizer is as a lightweight API server to run managed training jobs on GPU nodes using Volcano scheduler.

Prerequisites#

Before installing NeMo Customizer, make sure that you have all of the following:

Minimum System Requirements

  • A single-node Kubernetes cluster on a Linux host and cluster-admin level permissions.

  • At least 200 GB of free disk space.

  • At least one dedicated GPUs (A100 80 GB or H100 80 GB)

Storage

Kubernetes

Microservices

NeMo Customizer requires the following NeMo microservices installed:


Values Setup for Multi-node Training on AWS#

AWS EKS and EFA#

  1. Define your initial values.yaml file.

  2. Download overrides.values.yaml:

    postgresql:
      primary:
        # Disable huge_pages to run on nodes with hugepages. Otherwise postgres errors with `Bus error ...`
        extendedConfiguration: |-
          huge_pages = off
        initdb:
          args: "--set huge_pages=off"
    
  3. Install NeMo Customizer with the overrides.values.yaml file:

    helm -n nemo-customizer \
      install customizer \
      nemo-customizer-<chart version>.tgz \
      -f <path-to-your-values-file> \
      -f aws-overrides.values.yaml
    

Note

It is important to pass the overrides.values.yaml file last to give it precedence over the other values file.

Configure Features#

NeMo Customizer utilizes several services that you can deploy independently or test with default subcharts and values.

Queue Executor#

You have two options for the queue executor: Volano and Run:AI.

Volcano#

Install Volcano#

Install Volcano in a separate namespace from where you run NeMo Customizer. For more information, see Volcano’s documentation.

Note

Installing Volcano controller pods in the same namespace may result in inconsistent behavior for job monitoring and removes multi-node training support.

Customize Volcano Queue#

In your custom values file for the NeMo Microservices Helm Chart or NeMo Customizer Helm chart, you can configure a Volcano queue for NeMo Customizer training jobs. The queue must have gpu and mlnxnics capabilities to schedule training jobs.

Tip

For more information about the Volcano queue, refer to Queue in the Volcano documentation.

Configure in NeMo Microservices Helm Chart#

The NeMo Microservices Helm Chart has default values for setting up a default Volcano queue. If you use the NeMo Microservices Helm Chart, set up the Volcano configuration values as follows:

  • If you want to use the default queue configured by NeMo Customizer, set volcano: enabled: true and keep customizer.customizerConfig.training.queue set to "default".

  • If you want to use your own Volcano queue, set volcano: enabled: false and specify the Volcano queue name to customizer.customizerConfig.training.queue.

Configure in NeMo Customizer Helm Chart#

If you install the NeMo Customizer Helm chart, set up Volcano first in your cluster and specify the Volcano queue name to customizerConfig.training.queue.

Run:AI#

Alternatively, you can use Run:AI as the queue and executor for NeMo Customizer.

To configure NeMo Customizer to use the Run:AI executor, add the following manifest snippet to your custom values file: customizer.runai.override.values.yaml. This sample manifest is for cases where you use the NeMo Microservices Helm Chart. Adapt your custom values files accordingly if you want to install the microservices individually.

Weights & Biases in Run:AI#

If configuring Weights & Biases, you need to update the following with your keys in the customizer.runai.override.values.yaml file:

customizer:
  customizerConfig:
    training:
      container_defaults:
        env:
          - name: WANDB_API_KEY
            value: 'xxx'
          - name: WANDB_ENCRYPTED_API_KEY
            value: 'xxx'
          - name: WANDB_ENCRYPTION_KEY
            value: 'xxx'

Note

For configuring Weights & Biases while using Volcano, refer to the Metrics tutorial

MLflow#

You can configure NeMo Customizer to use MLflow to monitor training jobs. You need to deploy MLflow and set up the connection with the NeMo Customizer microservice.

  1. Create a mlflow.values.yaml file.

    postgresql:
      enabled: true
      auth:
        username: "bn_mlflow"
        password: "bn_mlflow"
    
    tracking:
      enabled: true
      auth:
        enabled: false
      runUpgradeDB: false
      service:
        type: ClusterIP
      resourcesPreset: medium
    
    run:
      enabled: false
    
  2. Install MLflow using helm.

    helm install -n mlflow-system --create-namespace mlflow oci://registry-1.docker.io/bitnamicharts/mlflow --version 1.0.6 -f mlflow.values.yaml
    
  3. Integrate NeMo Customizer with MLflow by setting customizerConfig.mlflowURL in values.yaml.

    customizerConfig:
      # mlflowURL is the internal K8s DNS record for the mlflow service.
      # Example: "http://mlflow-tracking.mlflow-system.svc.cluster.local:80"
      mlflowURL: ""
    
  4. Upgrade NeMo Customizer using helm.

       helm -n nemo-customizer \
         upgrade customizer \
         nemo-customizer-<chart version>.tgz \
         -f <path-to-your-values-file>
    

WandB#

You can customize WandB configuration for NeMo Customizer to log data under specifc team or project.

  1. Update default WandB configs of NeMo Customizer by setting customizerConfig.wandb in values.yaml.

    customizerConfig:
      # -- Weights and Biases (WandB) Python SDK intialization configuration for logging and monitoring training jobs in WandB.
      wandb:
       # -- The username or team name under which the runs will be logged.
       # -- If not specified, the run will default to a default entity set in the account settings.
       # -- To change the default entity, go to the account settings https://wandb.ai/settings
       # -- and update the “Default location to create new projects” under “Default team”.
       # -- Reference: https://docs.wandb.ai/ref/python/init/
       entity: null
       # The name of the project under which this run will be logged
       project: "nvidia-nemo-customizer"
    
  2. Upgrade NeMo Customizer using helm.

       helm -n nemo-customizer \
         upgrade customizer \
         nemo-customizer-<chart version>.tgz \
         -f <path-to-your-values-file>
    

Models Config#

Configure GPU allocation and parallelism for model training in customizerConfig.models. The key parameters are:

  • num_gpus: GPUs per node (default: 1)

  • num_nodes: Number of nodes to use (default: 1)

  • tensor_parallel_size: Number of GPUs to split model across (must evenly divide total GPUs)

Tip

For best performance, fully utilize GPUs on a single node before scaling to multiple nodes.

Example configuration for distributed training:

customizerConfig:
  models:
    my-base-model:
      training_options:
      - training_type: sft
        finetuning_type: lora
        num_gpus: 8             # GPUs per node
        num_nodes: 2            # Use 2 nodes
        tensor_parallel_size: 8 # Split model across 8 GPUs

For more details on parallelism strategies, see how parallelism works.

Job Pod Tolerations#

To dedicate specific nodes for NeMo Customizer jobs using Kubernetes taints and tolerations, you can do the following:

  1. Drain existing workloads from the target node:

    kubectl get pods --all-namespaces --field-selector spec.nodeName=<TARGET_NODE>
    kubectl drain <TARGET_NODE> --ignore-daemonsets
    
  2. Add taint to the node:

    kubectl taint nodes <TARGET_NODE> app=customizer:NoSchedule
    
  3. Add matching tolerations to the job pod in values.yaml under the customizerConfig.tolerations key:

    customizerConfig:
      tolerations:
        - key: app
          value: customizer
          operator: Equal
          effect: NoSchedule
    

For more details, see Kubernetes taints and tolerations.