NeMo Customizer Microservice Deployment Guide#
NeMo Customizer is as a lightweight API server to run managed training jobs on GPU nodes using Volcano scheduler.
Prerequisites#
Before installing NeMo Customizer, make sure that you have all of the following:
Minimum System Requirements
A single-node Kubernetes cluster on a Linux host and cluster-admin level permissions.
At least 200 GB of free disk space.
At least one dedicated GPUs (A100 80 GB or H100 80 GB)
Storage
Access to an external PostgreSQL Database to store model customization objects.
Access to an NFS-backed Persistent Volume that supports
ReadWriteMany
access mode to enable fast checkpointing and minimize network traffic.
Kubernetes
A dedicated namespace.
Secrets assigned to that namespace for all of the following:
NGC Image pull credentials: Required to download the images.
Database credentials: Required for production database protection.
An available
StorageClass
for the NeMo Data Store.NVIDIA Network Operator: Simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster.
Reviewed Tenant Configuration Options.
Values Setup for Installing NeMo Customizer#
If you want to install NeMo Customizer as a standalone microservice, you need to configure the following value overrides in the values.yaml
file.
customizer:
enabled: true
data-store:
enabled: true
entity-store:
enabled: true
nemo-operator:
enabled: true
evaluator:
enabled: false
guardrails:
enabled: false
deployment-management:
enabled: false
nim-operator:
enabled: false
nim-proxy:
enabled: false
Install DGX Cloud Admission Controller#
DGX Cloud Admission Controller is required for configuring cluster networking for DGX Cloud, Elastic Kubernetes Service (EKS) on AWS, Azure Kubernetes Service (AKS) and Google Kubernetes Engine (GKE).
In the values.yaml
file, set the dgxc-admission-controller.enabled
value to true
.
dgxc-admission-controller:
enabled: true
Values Setup for Multi-node Training on AWS#
To set up NeMo Customizer for multi-node training on AWS, you need to configure the following value overrides.
To Configure with AWS EKS and EFA#
Define your initial
values.yaml
file.Download
overrides.values.yaml
:postgresql: primary: # Disable huge_pages to run on nodes with hugepages. Otherwise postgres errors with `Bus error ...` extendedConfiguration: |- huge_pages = off initdb: args: "--set huge_pages=off"
Install NeMo Customizer with the
overrides.values.yaml
file:helm --namespace nemo-customizer install nemo-customizer \ nemo-microservices-helm-chart \ -f <path-to-your-values-file> \ -f aws-overrides.values.yaml
Note
It is important to pass the overrides.values.yaml
file last to give it precedence over the other values file.
Configure Features#
NeMo Customizer utilizes several services that you can deploy independently or test with default subcharts and values.
Queue Executor#
You have two options for the queue executor: Volano and Run:AI.
Volcano#
Install Volcano#
The demo-values.yaml
has the following section:
volcano:
enabled: true
This value installs a Volcano controller pods into the same namespace as where the NeMo Customizer training jobs run. To support multi-node training, install Volcano in a separate namespace from where you run NeMo Customizer. For more information, see Volcano’s documentation.
Customize Volcano Queue#
In your custom values file for the NeMo Microservices Helm Chart, you can configure a Volcano queue for NeMo Customizer training jobs. The queue must have gpu
and mlnxnics
capabilities to schedule training jobs.
Tip
For more information about the Volcano queue, refer to Queue in the Volcano documentation.
The NeMo Microservices Helm Chart has default values for setting up a default Volcano queue. Set up the Volcano configuration values as follows:
If you want to use the default queue pre-configured in the chart, set
volcano: enabled: true
and keepcustomizer.customizerConfig.training.queue
set to"default"
.If you want to use your own Volcano queue, set
volcano: enabled: false
and specify the Volcano queue name tocustomizer.customizerConfig.training.queue
.
Run:AI#
Alternatively, you can use Run:AI as the queue and executor for NeMo Customizer.
To configure NeMo Customizer to use the Run:AI executor, add the following manifest snippet to your custom values file: customizer.runai.override.values.yaml
.
This sample manifest is for cases where you use the NeMo Microservices Helm Chart.
Adapt your custom values files accordingly if you want to install the microservices individually.
Weights & Biases in Run:AI#
If configuring Weights & Biases, you need to update the following with your keys in the customizer.runai.override.values.yaml
file:
customizer:
customizerConfig:
training:
container_defaults:
env:
- name: WANDB_API_KEY
value: 'xxx'
- name: WANDB_ENCRYPTED_API_KEY
value: 'xxx'
- name: WANDB_ENCRYPTION_KEY
value: 'xxx'
Note
For configuring Weights & Biases while using Volcano, refer to the Metrics tutorial
MLflow#
You can configure NeMo Customizer to use MLflow to monitor training jobs. You need to deploy MLflow and set up the connection with the NeMo Customizer microservice.
Create a
mlflow.values.yaml
file.postgresql: enabled: true auth: username: "bn_mlflow" password: "bn_mlflow" tracking: enabled: true auth: enabled: false runUpgradeDB: false service: type: ClusterIP resourcesPreset: medium run: enabled: false
Install MLflow using
helm
.helm install -n mlflow-system --create-namespace mlflow oci://registry-1.docker.io/bitnamicharts/mlflow --version 1.0.6 -f mlflow.values.yaml
Integrate NeMo Customizer with MLflow by setting
customizerConfig.mlflowURL
invalues.yaml
.customizerConfig: # mlflowURL is the internal K8s DNS record for the mlflow service. # Example: "http://mlflow-tracking.mlflow-system.svc.cluster.local:80" mlflowURL: ""
WandB#
You can customize WandB configuration for NeMo Customizer to log data under specific team or project as follows.
customizerConfig:
# -- Weights and Biases (WandB) Python SDK intialization configuration for logging and monitoring training jobs in WandB.
wandb:
# -- The username or team name under which the runs will be logged.
# -- If not specified, the run will default to a default entity set in the account settings.
# -- To change the default entity, go to the account settings https://wandb.ai/settings
# -- and update the “Default location to create new projects” under “Default team”.
# -- Reference: https://docs.wandb.ai/ref/python/init/
entity: null
# The name of the project under which this run will be logged
project: "nvidia-nemo-customizer"