NeMo Framework currently supports data preparation, base model training, conversion, and evaluation of GPT models on Kubernetes (k8s) clusters with additional models and stages to be included in the future.
At a minimum, k8s clusters need at least two worker nodes, such as DGX A100 or DGX H100s, and one controller node with helm
and kubectl
installed and the ability to launch Helm charts which run on the worker nodes. The implementation has been validated on a vanilla k8s environment, but it can be extended to support different k8s environments, such as managed k8s environments from various CSPs.
Additionally, the cluster needs shared NFS storage mounted in the same location on all nodes including both workers and the controller(s). This NFS location should be where the NeMo-Megatron-Launcher repository is copied and can be accessible in pods.
Lastly, the worker nodes should have at least one high-speed compute fabric, such as InfiniBand, to allow the worker nodes to communicate with one another.
Operators
A few k8s operators need to be installed for the k8s stages to function. The following list includes the versions of operators that need to be installed and were validated by NVIDIA, though the exact versions listed below are not necessarily required.
GPU Operator: 23.3.2
Network Operator: 23.1.0
KubeFlow Training Operator: 1.6.0
All operators above can be installed with helm
.
Typical Flow
Launching jobs on k8s clusters typically follow the same pattern. This pattern is as follows:
Update the cluster config file at
conf/cluster/k8s.yaml
if necessary.Update the main config file at
conf/config.yaml
and change bothcluster
andcluster_type
tok8s
. Update thelauncher_scripts_path
to point to the NeMo-Megatron-Launcher repository hosted on the NFS. Also update the stages as necessary.Run
python3 main.py
to launch the job. This will create a Helm chart in theresults
directory for the job namedk8s_template
and will be launched automatically.Verify the job was launched with
helm list
andkubectl get pods
.View job logs by reading the logs from the first pod with
kubectl logs --follow <first pod name>
.Once the job finishes, clean up the job by running
helm uninstall <job name>
.
Container Secrets
In order for the k8s services to authenticate with private container registries that may host the NeMo FW training and inference containers, a secret key needs to be generated with a token for the registry. For example, to create a secret for NGC, first get a token from ngc.nvidia.com, then run the following on the controller to create a key named ngc-registry
:
kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>
Update the pull_secret
value in conf/cluster/k8s.yaml
with the name of the secret key so the cluster will authenticate with the container registry.