Running NeMo Framework on Kubernetes
NeMo Framework supports DGX A100 and H100-based Kubernetes (k8s) clusters with attached NFS storage and compute networking. Currently, the k8s updates only support GPT-based models for the data preparation, base model pre-training, model conversion, and evaluation stages with more stages and models to come soon. This document is intended as a quick-start guide to setup a cluster, prepare a dataset, and pre-train a base model with NeMo FW on k8s clusters.
This document assumes a Kubernetes cluster has been provisioned with at least 2x DGX A100 or DGX H100s set as worker nodes with both the GPU and Network Operators configured so k8s will label each worker with 8x GPU resources per node and one or more high-speed compute network links per node (ie. 4x InfiniBand links) as allocatable in jobs.
Additionally, the following support matrix has been tested. All Operators listed below are required in addition to Helm. The Operators can be installed with
Cluster Management Software: Bright Cluster Manager v10.0
As for the hardware, NFS is required for the storage backend and it needs to be mounted on all nodes in the cluster at the same path. Additionally, the head node where jobs will be launched requires both
kubectl be accessible by the user. Lastly, the head node needs to be able to install Python dependencies via
pip. This can either be done in a virtual environment or on the bare metal.
The following sections will walk through setup, data preparation, and training a base GPT model. The steps assume you are running from the head node of the k8s cluster as your preferred user which has been added to the k8s cluster. The k8s cluster and all relevant Operators should already be installed and configured at this point.
First, a secret key needs to be created on the k8s cluster to authenticate with the NGC private registry. If not done already, get an NGC key from ngc.nvidia.com. Create a secret key on the k8s cluster with (replace
<NGC KEY HERE> with your NGC secret key. Note that if you have any special characters in your key you might need to wrap the key in single quotes (
') so it can be parsed correctly by k8s):
kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>
Next, clone the k8s support branch from GitHub with:
git clone https://github.com/NVIDIA/nemo-megatron-launcher
The default config files need to be updated to reflect your cluster. Navigate to the
launcher_scripts directory with:
conf/cluster/k8s.yaml config file with a text editor and update the default values to reflect your settings. The parameters are explained below:
pull_secret: This is the name of the secret key created earlier. In the example above, the name was
shm_size: This is the amount of system memory to allocate in Pods. Should end in “Gi” for gigabytes.
nfs_server: This is either the hostname or IP address for the NFS server where data is stored on all nodes.
nfs_path: This is the path that should be mounted from the NFS server into Pods.
ib_resource_name: This is the resource label assigned by k8s which is associated with the compute network, such as
ib_count: The number of compute links per node, such as “4”.
After updating the cluster configuration, install the required Python packages with:
pip install -r ../requirements.txt
Once the repository has been cloned and the cluster configuration is set, the configuration files can be modified to launch the data preparation stage for The Pile dataset. First, a few config files need to be updated. Back at the
launcher_scripts directory from before, open
conf/config.yaml in a text editor. The following settings need to be modified:
cluster: This needs to be “k8s”.
data_preparation: This should be “gpt3/download_gpt3_pile”.
stages: This needs to be just “data_preparation” and no other stage.
cluster_type: This needs to be “k8s”. Note that this is different from the “cluster” parameter above.
launcher_scripts_path: This is the path to the cloned nemo-megatron-launcher repository. This needs to end in
container: Replace this with the latest version of the training container available on NGC, such as “nvcr.io/ea-bignlp/ga-participants/nemofw-training:<TAG>”. Replace
<TAG>with the latest container tag available.
conf/data_preparation/gpt3/download_gpt3_pile.yaml in a text editor. Update the
node_array_size to the number of worker nodes in the cluster. The higher the worker count, the faster data preparation will finish.
Once all config files have been updated, it is time to launch the data preparation job. To do so, run
python3 main.py. This will create a Helm chart at
results/download_gpt3_pile/preprocess/k8s_template. The Helm chart will be launched automatically by the launcher script. This can be verified with
helm list which will show the data preparation job.
After the chart is installed and once the worker nodes are available, the job will be launched on the number of requested worker nodes. This can be verified with:
$ kubectl get pods NAME READY STATUS RESTARTS AGE nlp-data-prep-launcher-zbv7p 1/1 Running 0 6s nlp-data-prep-worker-0 1/1 Running 0 6s nlp-data-prep-worker-1 1/1 Running 0 6s nlp-data-prep-worker-2 1/1 Running 0 6s nlp-data-prep-worker-3 1/1 Running 0 6s
In the scenario above, 4 worker nodes are being used for data preparation. Progress can be monitored by following the logs of the first Pod. The following command will show the output of the first Pod (the final 5 letters in the Pod name will be different for every run - replace with the output from your command above):
kubectl logs --follow nlp-data-prep-launcher-zbv7p
Depending on the number of nodes that were requested, this will take a few hours to download all 30 shards of The Pile dataset, extract each shard, then pre-process the extracted files.
In addition to the logs, the data directory can be monitored manually. Running an
du -sh against the
data directory (default location is at the
launcher_scripts/data path unless otherwise altered). During the data prep process, several
.jsonl.zst files will be downloaded, then extracted to
.jsonl files, then finally converted to
.idx files. At the end, there should be 30
.bin and 30
.idx files - one for each shard.
Once the dataset has been fully preprocessed, you can remove the helm chart with:
helm uninstall download-gpt3-pile
This will free up the worker nodes.
After preparing the dataset, a base model can be trained. First, open
conf/config.yaml in a text editor. Update the
stages to be just “training”. For this example, the 5B model will be used. Update the
training parameter in the
defaults section at the top of the file to “gpt3/5b”.
conf/training/gpt3/5b.yaml in a text editor. Update
num_nodes to the number of nodes to run on for your cluster. The remaining settings can be kept with their default values.
To launch the training job, run:
This will create a Helm chart at
results/gpt3_5b/k8s_template. The chart will be installed automatically and run by the launcher script.
When resources are available on the cluster, it will launch as many worker Pods as the number of nodes that were requested. This can be verified with:
$ kubectl get pods NAME READY STATUS RESTARTS AGE nlp-training-worker-0 1/1 Running 0 26h nlp-training-worker-1 1/1 Running 0 26h nlp-training-worker-2 1/1 Running 0 26h nlp-training-worker-3 1/1 Running 0 26h
After a few minutes when the NeMo library is finished initializing, the training job will begin on the cluster.
To monitor progress, the training logs can be followed by looking at the logs of the first Pod with:
kubectl logs --follow nlp-training-worker-0
Depending on the number of nodes that were requested, it may take a few days for the model to finish training. By default, every 2000 global steps, a checkpoint will be generated and saved at
The training can either be run to completion or terminated early once the desired number of steps have been run. Once ready, the job can be cleaned up by running:
helm uninstall gpt3-5b
This will free up the worker nodes for future jobs.
With the base dataset pre-processed in the workspace and pre-training completed for the base model, any additional fine-tuning and deployment steps can be done by following the common flow below:
Update the cluster config file at
Update the main config file at
conf/config.yamland change both
k8s. Update the
launcher_scripts_pathto point to the NeMo-Megatron-Launcher repository hosted on the NFS. Also update the stages as necessary.
python3 main.pyto launch the job. This will create a Helm chart in the
resultsdirectory for the job named
k8s_templateand will be launched automatically.
Verify the job was launched with
kubectl get pods.
View job logs by reading the logs from the first pod with
kubectl logs --follow <first pod name>.
Once the job finishes, clean up the job by running
helm uninstall <job name>.
In some cases the deployment will not run as intended due to an incorrect config. To resolve the issue, any running jobs need to be stopped by first removing the Helm chart. This can be done by running
helm list which will show all deployed Helm charts on the cluster. Find the chart causing issues and stop it with
helm uninstall <chart name>.
After stopping the chart, any Pods that were running will be spun down by the cluster automatically. Verify this by running
kubectl get pods which will list all Pods in the default namespace. Depending on the job that was running, it could take a few minutes for some Pods to terminate. If they don’t terminate automatically, they can be manually removed with
kubectl delete <pod name>.
Advanced: For those familiar with Helm charts, the charts can be manually edited in the
results directory for the job. Each job will have a directory under the model name and the Helm chart is located at
k8s_template. The files can be manually updated as needed and the job can be launched with
helm install <job name> results/<path to specific job>/k8s_template.