Deploy Kubernetes#

This documentation is part of NVIDIA DGX BasePOD: Deployment Guide Featuring NVIDIA DGX A100 Systems.

Note

Before you complete the steps in this documentation, complete Deploy Docker.

Warning

The # prompt indicates commands that you execute as the root user on a head node. The % prompt indicates commands that you execute within cmsh.

Install and Verify Kubernetes (K8s)#

  1. Run the cm-kubernetes-setup CLI wizard as the root user on the head node.

    1# cm-kubernetes-setup
    
  2. Choose Deploy to start the deployment.

    _images/kubernetes-2.png
  3. Select K8s version v.1.27 in the dialog that appears next and select Ok to continue.

    _images/kubernetes-3.png
  4. Select Ok to confirm the Containerd container runtime.

    _images/kubernetes-4.png
  5. Fill in the optional DockerHub registry mirror endpoint if necessary—otherwise, select Ok to continue.

    _images/kubernetes-5.png
  6. Accept the default settings for this K8s cluster—to match the naming chosen for this guide, change the K8s cluster name to onprem.

    _images/kubernetes-6.png
  7. Select yes to expose the K8s API server on the head node. This allows users to use the K8s cluster from the head node.

    _images/kubernetes-7.png
  8. Select internalnet since the K8s control plane nodes and the DGX nodes, which are the K8s worker nodes, are all connected on internalnet.

    _images/kubernetes-8.png
  9. Select all three K8s control plane nodes: knode01, knode02, and knode03.

    _images/kubernetes-9.png
  10. Select dgx-a100 for the worker node category.

  1. Do not select any individual Kubernetes nodes, and select Ok to continue.

    _images/kubernetes-11.png
  2. Select all three K8s control plane nodes: knode01, knode02, and knode03 for the Etcd nodes.

    _images/kubernetes-12.png
  3. Accept the default values for the main Kubernetes components unless the organization requires specific ports.

    _images/kubernetes-13.png
  4. Select the Calico network plugin when prompted.

    _images/kubernetes-14.png
  5. Choose yes to install the Kyverno policy engine and then select Ok.

    _images/kubernetes-15.png
  6. Choose no to decline to configure HA for Kyverno and then select Ok.

    _images/kubernetes-16.png
  7. Choose whether to install Kyverno Policies and then select Ok. Unless required for the configuration, choose no.

    _images/kubernetes-17.png
  8. Select the following operators: NVIDIA GPU Operator, Network Operator, Prometheus Adapter, Prometheus Operator Stack, cm-jupyter-kernel-operator, and the cm kubernetes-mpi-operator to install.

    _images/kubernetes-18.png
  9. Skip the optional YAML config for the Network Operator helm chart.

    _images/kubernetes-19.png
  10. Configure the Network Operator by selecting nfs.enabled, sriovNetworkOperator.enabled, deployCR, secondaryNetwork.deploy, secondaryNetwork.cniPlugins.deploy, secondaryNetwork.multus.deploy, and secondaryNetwork.ipamPlugin.deploy.

    _images/kubernetes-20.png
  11. Select the Ingress Controller (Nginx), Kubernetes Dashboard, Kubernetes Metrics Server, and Kubernetes State Metrics to deploy.

    _images/kubernetes-21.png
  12. Select the defaults unless specific ingress ports are to be used.

    _images/kubernetes-22.png
  13. Select no since the K8s control plane nodes do not have GPUs.

    _images/kubernetes-23.png
  14. Select yes to deploy the Permission Manager.

    _images/kubernetes-24.png
  15. Select both enabled and default for the Local path storage class.

    _images/kubernetes-25.png
  16. Accept the default data storage path and leave the other two fields blank, which is the default.

    _images/kubernetes-26.png
  17. Select Save config & deploy.

    _images/kubernetes-27.png
  18. Change the filepath to /root/cm-kubernetes-setup-onprem.conf and select Ok. This file can be used to redeploy K8s or copied and modified to deploy additional K8s clusters. Wait for the installation to finish.

    _images/kubernetes-28.png
  19. Verify that the K8s cluster is installed properly.

     1# module load kubernetes/onprem/1.27.4-00
     2# kubectl cluster-info
     3Kubernetes control plane is running at https://127.0.0.1:10443
     4CoreDNS is running at https://127.0.0.1:10443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
     5
     6To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
     7
     8# kubectl get nodes
     9NAME      STATUS   ROLES                  AGE     VERSION
    10dgx01     Ready    worker                 7m36s   v1.27.4
    11dgx02     Ready    worker                 7m37s   v1.27.4
    12dgx03     Ready    worker                 7m36s   v1.27.4
    13dgx04     Ready    worker                 7m37s   v1.27.4
    14knode01   Ready    control-plane,master   7m59s   v1.27.4
    15knode02   Ready    control-plane,master   7m26s   v1.27.4
    16knode03   Ready    control-plane,master   7m25s   v1.27.4
    

Next Steps#

After you complete the steps on this page, see Deploy Slurm.