NeMo Framework currently supports data preparation, base model training, conversion, and evaluation of GPT models on Kubernetes (k8s) clusters with additional models and stages to be included in the future.
At a minimum, k8s clusters need at least two worker nodes, such as DGX A100 or DGX H100s, and one controller node with
kubectl installed and the ability to launch Helm charts which run on the worker nodes. The implementation has been validated on a vanilla k8s environment, but it can be extended to support different k8s environments, such as managed k8s environments from various CSPs.
Additionally, the cluster needs shared NFS storage mounted in the same location on all nodes including both workers and the controller(s). This NFS location should be where the NeMo-Megatron-Launcher repository is copied and can be accessible in pods.
Lastly, the worker nodes should have at least one high-speed compute fabric, such as InfiniBand, to allow the worker nodes to communicate with one another.
A few k8s operators need to be installed for the k8s stages to function. The following list includes the versions of operators that need to be installed and were validated by NVIDIA, though the exact versions listed below are not necessarily required.
All operators above can be installed with
Launching jobs on k8s clusters typically follow the same pattern. This pattern is as follows:
Update the cluster config file at
Update the main config file at
conf/config.yamland change both
k8s. Update the
launcher_scripts_pathto point to the NeMo-Megatron-Launcher repository hosted on the NFS. Also update the stages as necessary.
python3 main.pyto launch the job. This will create a Helm chart in the
resultsdirectory for the job named
k8s_templateand will be launched automatically.
Verify the job was launched with
kubectl get pods.
View job logs by reading the logs from the first pod with
kubectl logs --follow <first pod name>.
Once the job finishes, clean up the job by running
helm uninstall <job name>.
In order for the k8s services to authenticate with private container registries that may host the NeMo FW training and inference containers, a secret key needs to be generated with a token for the registry. For example, to create a secret for NGC, first get a token from ngc.nvidia.com, then run the following on the controller to create a key named
kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>
pull_secret value in
conf/cluster/k8s.yaml with the name of the secret key so the cluster will authenticate with the container registry.