Run NeMo Framework on Kubernetes

NeMo Framework supports DGX A100 and H100-based Kubernetes (k8s) clusters with attached NFS storage and compute networking. Currently, the k8s updates only support GPT-based models for the data preparation, base model pre-training, model conversion, and evaluation stages with more stages and models to come soon. This document is intended as a quick-start guide to setup a cluster, prepare a dataset, and pre-train a base model with NeMo FW on k8s clusters.

Pre-Requisites

This document assumes a Kubernetes cluster has been provisioned with at least 2x DGX A100 or DGX H100s set as worker nodes with both the GPU and Network Operators configured so k8s will label each worker with 8x GPU resources per node and one or more high-speed compute network links per node (ie. 4x InfiniBand links) as allocatable in jobs.

Additionally, the following support matrix has been tested. All Operators listed below are required in addition to Helm. The Operators can be installed with helm.

Cluster Management Software: Bright Cluster Manager v10.0
Network Operator: 23.1.0
GPU Operator: 23.3.2
MPI Operator: 0.4.0
KubeFlow Training Operator: 1.6.0
Helm: 3.12.2
Kubernetes: 1.27.4

As for the hardware, NFS is required for the storage backend and it needs to be mounted on all nodes in the cluster at the same path. Additionally, the head node where jobs will be launched requires both helm and kubectl be accessible by the user. Lastly, the head node needs to be able to install Python dependencies via pip. This can either be done in a virtual environment or on the bare metal.

Steps

The following sections will walk through setup, data preparation, and training a base GPT model. The steps assume you are running from the head node of the k8s cluster as your preferred user which has been added to the k8s cluster. The k8s cluster and all relevant Operators should already be installed and configured at this point.

Setup

First, a secret key needs to be created on the k8s cluster to authenticate with the NGC private registry. If not done already, get an NGC key from ngc.nvidia.com. Create a secret key on the k8s cluster with (replace <NGC KEY HERE> with your NGC secret key. Note that if you have any special characters in your key you might need to wrap the key in single quotes (') so it can be parsed correctly by k8s):

Copy
Copied!

            
            kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>

Next, clone the k8s support branch from GitHub with:

Copy
Copied!

            
            git clone https://github.com/NVIDIA/nemo-megatron-launcher

The default config files need to be updated to reflect your cluster. Navigate to the launcher_scripts directory with:

Copy
Copied!

            
            cd nemo-megatron-launcher/launcher_scripts

Open the conf/cluster/k8s.yaml config file with a text editor and update the default values to reflect your settings. The parameters are explained below:

pull_secret: This is the name of the secret key created earlier. In the example above, the name was ngc-registry.
shm_size: This is the amount of system memory to allocate in Pods. Should end in “Gi” for gigabytes.
nfs_server: This is either the hostname or IP address for the NFS server where data is stored on all nodes.
nfs_path: This is the path that should be mounted from the NFS server into Pods.
ib_resource_name: This is the resource label assigned by k8s which is associated with the compute network, such as nvidia.com/hostdev.
ib_count: The number of compute links per node, such as “4”.

After updating the cluster configuration, install the required Python packages with:

Copy
Copied!

            
            pip install -r ../requirements.txt

Data Preparation

Once the repository has been cloned and the cluster configuration is set, the configuration files can be modified to launch the data preparation stage for The Pile dataset. First, a few config files need to be updated. Back at the launcher_scripts directory from before, open conf/config.yaml in a text editor. The following settings need to be modified:

cluster: This needs to be “k8s”.
data_preparation: This should be “gpt3/download_gpt3_pile”.
stages: This needs to be just “data_preparation” and no other stage.
cluster_type: This needs to be “k8s”. Note that this is different from the “cluster” parameter above.
launcher_scripts_path: This is the path to the cloned nemo-megatron-launcher repository. This needs to end in launcher_scripts.
container: Replace this with the latest version of the training container available on NGC, such as “nvcr.io/ea-bignlp/ga-participants/nemofw-training:<TAG>”. Replace <TAG> with the latest container tag available.

Next, open conf/data_preparation/gpt3/download_gpt3_pile.yaml in a text editor. Update the node_array_size to the number of worker nodes in the cluster. The higher the worker count, the faster data preparation will finish.

Once all config files have been updated, it is time to launch the data preparation job. To do so, run python3 main.py. This will create a Helm chart at results/download_gpt3_pile/preprocess/k8s_template. The Helm chart will be launched automatically by the launcher script. This can be verified with helm list which will show the data preparation job.

After the chart is installed and once the worker nodes are available, the job will be launched on the number of requested worker nodes. This can be verified with:

Copy
Copied!

            
            $ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
nlp-data-prep-launcher-zbv7p   1/1     Running   0          6s
nlp-data-prep-worker-0         1/1     Running   0          6s
nlp-data-prep-worker-1         1/1     Running   0          6s
nlp-data-prep-worker-2         1/1     Running   0          6s
nlp-data-prep-worker-3         1/1     Running   0          6s

In the scenario above, 4 worker nodes are being used for data preparation. Progress can be monitored by following the logs of the first Pod. The following command will show the output of the first Pod (the final 5 letters in the Pod name will be different for every run - replace with the output from your command above):

Copy
Copied!

            
            kubectl logs --follow nlp-data-prep-launcher-zbv7p

Depending on the number of nodes that were requested, this will take a few hours to download all 30 shards of The Pile dataset, extract each shard, then pre-process the extracted files.

In addition to the logs, the data directory can be monitored manually. Running an ls or du -sh against the data directory (default location is at the launcher_scripts/data path unless otherwise altered). During the data prep process, several .jsonl.zst files will be downloaded, then extracted to .jsonl files, then finally converted to .bin and .idx files. At the end, there should be 30 .bin and 30 .idx files - one for each shard.

Once the dataset has been fully preprocessed, you can remove the helm chart with:

Copy
Copied!

            
            helm uninstall download-gpt3-pile

This will free up the worker nodes.

Model Training

After preparing the dataset, a base model can be trained. First, open conf/config.yaml in a text editor. Update the stages to be just “training”. For this example, the 5B model will be used. Update the training parameter in the defaults section at the top of the file to “gpt3/5b”.

Next, open conf/training/gpt3/5b.yaml in a text editor. Update num_nodes to the number of nodes to run on for your cluster. The remaining settings can be kept with their default values.

To launch the training job, run:

Copy
Copied!

            
            python3 main.py

This will create a Helm chart at results/gpt3_5b/k8s_template. The chart will be installed automatically and run by the launcher script.

When resources are available on the cluster, it will launch as many worker Pods as the number of nodes that were requested. This can be verified with:

Copy
Copied!

            
            $ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
nlp-training-worker-0   1/1     Running   0          26h
nlp-training-worker-1   1/1     Running   0          26h
nlp-training-worker-2   1/1     Running   0          26h
nlp-training-worker-3   1/1     Running   0          26h

After a few minutes when the NeMo library is finished initializing, the training job will begin on the cluster.

To monitor progress, the training logs can be followed by looking at the logs of the first Pod with:

Copy
Copied!

            
            kubectl logs --follow nlp-training-worker-0

Depending on the number of nodes that were requested, it may take a few days for the model to finish training. By default, every 2000 global steps, a checkpoint will be generated and saved at results/gpt3_5b/results/checkpoints.

The training can either be run to completion or terminated early once the desired number of steps have been run. Once ready, the job can be cleaned up by running:

Copy
Copied!

            
            helm uninstall gpt3-5b

This will free up the worker nodes for future jobs.

Next Steps

With the base dataset pre-processed in the workspace and pre-training completed for the base model, any additional fine-tuning and deployment steps can be done by following the common flow below:

Update the cluster config file at conf/cluster/k8s.yaml if necessary.
Update the main config file at conf/config.yaml and change both cluster and cluster_type to k8s. Update the launcher_scripts_path to point to the NeMo-Megatron-Launcher repository hosted on the NFS. Also update the stages as necessary.
Run python3 main.py to launch the job. This will create a Helm chart in the results directory for the job named k8s_template and will be launched automatically.
Verify the job was launched with helm list and kubectl get pods.
View job logs by reading the logs from the first pod with kubectl logs --follow <first pod name>.
Once the job finishes, clean up the job by running helm uninstall <job name>.

Troubleshooting

In some cases the deployment will not run as intended due to an incorrect config. To resolve the issue, any running jobs need to be stopped by first removing the Helm chart. This can be done by running helm list which will show all deployed Helm charts on the cluster. Find the chart causing issues and stop it with helm uninstall <chart name>.

After stopping the chart, any Pods that were running will be spun down by the cluster automatically. Verify this by running kubectl get pods which will list all Pods in the default namespace. Depending on the job that was running, it could take a few minutes for some Pods to terminate. If they don’t terminate automatically, they can be manually removed with kubectl delete <pod name>.

Advanced: For those familiar with Helm charts, the charts can be manually edited in the results directory for the job. Each job will have a directory under the model name and the Helm chart is located at k8s_template. The files can be manually updated as needed and the job can be launched with helm install <job name> results/<path to specific job>/k8s_template.