Kubernetes

NeMo Framework currently supports data preparation, base model training, conversion, and evaluation of GPT models on Kubernetes (k8s) clusters with additional models and stages to be included in the future.

At a minimum, k8s clusters need at least two worker nodes, such as DGX A100 or DGX H100s, and one controller node with helm and kubectl installed and the ability to launch Helm charts which run on the worker nodes. The implementation has been validated on a vanilla k8s environment, but it can be extended to support different k8s environments, such as managed k8s environments from various CSPs.

Additionally, the cluster needs shared NFS storage mounted in the same location on all nodes including both workers and the controller(s). This NFS location should be where the NeMo-Megatron-Launcher repository is copied and can be accessible in pods.

Lastly, the worker nodes should have at least one high-speed compute fabric, such as InfiniBand, to allow the worker nodes to communicate with one another.

Operators

A few k8s operators need to be installed for the k8s stages to function. The following list includes the versions of operators that need to be installed and were validated by NVIDIA, though the exact versions listed below are not necessarily required.

  1. GPU Operator: 23.3.2

  2. Network Operator: 23.1.0

  3. KubeFlow Training Operator: 1.6.0

All operators above can be installed with helm.

Typical Flow

Launching jobs on k8s clusters typically follow the same pattern. This pattern is as follows:

  1. Update the cluster config file at conf/cluster/k8s.yaml if necessary.

  2. Update the main config file at conf/config.yaml and change both cluster and cluster_type to k8s. Update the launcher_scripts_path to point to the NeMo-Megatron-Launcher repository hosted on the NFS. Also update the stages as necessary.

  3. Run python3 main.py to launch the job. This will create a Helm chart in the results directory for the job named k8s_template and will be launched automatically.

  4. Verify the job was launched with helm list and kubectl get pods.

  5. View job logs by reading the logs from the first pod with kubectl logs --follow <first pod name>.

  6. Once the job finishes, clean up the job by running helm uninstall <job name>.

Container Secrets

In order for the k8s services to authenticate with private container registries that may host the NeMo FW training and inference containers, a secret key needs to be generated with a token for the registry. For example, to create a secret for NGC, first get a token from ngc.nvidia.com, then run the following on the controller to create a key named ngc-registry:

Copy
Copied!
            

kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>

Update the pull_secret value in conf/cluster/k8s.yaml with the name of the secret key so the cluster will authenticate with the container registry.

© Copyright 2023, NVIDIA. Last updated on Sep 13, 2023.