Kubernetes

  1. Update cm-setup to the latest version on the Headnode.

    # yum update cm-setup.x86_64
    
  2. Run the cm-kubernetes-setup CLI wizard as the root user on the head node.

    # cm-kubernetes-setup
    
  3. Choose Deploy to start the Kubernetes installation.

    _images/k8-1.png
  4. Select K8s version v.1.27 in the dialog that appears next, and select OK to continue.

    _images/k8-2.png
  5. Select OK to confirm the Containerd container runtime.

    _images/k8-31.png
  6. Fill in the optional DockerHub registry mirror endpoint if necessary – otherwise, select OK to continue.

    _images/k8-3.png
  7. Accept the default settings for this K8s cluster – to match the naming chosen for this guide, change the Kubernetes cluster name to onprem.

    _images/k8-4.png
  8. Select yes to expose the K8s API server on the head node.

    This allows users to use the K8s cluster from the head node.

    _images/k8-5.png
  9. Select internalnet since the K8s master nodes and the DGX nodes, which are the K8s worker nodes, are all connected on internalnet.

    _images/k8-6.png
  10. Select all three K8s control plane nodes: knode01, knode02, and knode03.

    _images/k8-7.png
  11. Select dgx-a100 for the worker node category.

    _images/k8-8.png
  12. Do not select any individual Kubernetes nodes, and select OK to continue.

    _images/k8-9.png
  13. Select all three K8s control plane nodes: knode01, knode02, and knode03 for the Etcd nodes.

    _images/k8-10.png
  14. Accept the default values for the main Kubernetes components unless the organization requires specific ports.

    _images/k8-11.png
  15. Select the Calico network plugin when prompted.

    _images/k8-12.png
  16. Choose yes to install the Kyverno policy engine and then select OK.

    _images/k8-13.png
  17. Choose no to decline to configure HA for Kyverno and then select OK.

    _images/k8-14.png
  18. Choose whether to install Kyverno Policies and then select OK.

    Unless required for the configuration, choose no.

    _images/k8-15.png
  19. Select no when prompted about an NVAIE License.

    _images/k8-16.png
  20. Select the following operators: NVIDIA GPU Operator, Network Operator, Prometheus Adapter, Prometheus Operator Stack, cm-jupyter-kernel-operator, and the cm-kubernetes-mpi-operator to install.

    _images/k8-33.png
  21. Choose the GPU Operator version.

    _images/k8-18.png
  22. Choose the GPU Operator version.

    _images/k8-19.png
  23. Skip the optional YAML config for the GPU Operator helm chart.

    _images/k8-20.png
  24. Choose nfd.enable for the NVIDIA GPU Operator.

    _images/k8-21.png
  25. Do not include a YAML file for the Network Operator.

    _images/k8-22.png
  26. Configure the Network Operator by selecting nfs.enabled, sriovNetworkOperator.enabled, deployCR, secondaryNetwork.deploy, secondaryNetwork.cniPlugins.deploy, secondaryNetwork.multus.deploy, and secondaryNetwork.ipamPlugin.deploy.

    _images/k8-23.png
  27. Select the Ingress Controller (Nginx), Kubernetes Dashboard, Kubernetes Metrics Server, and Kubernetes State Metrics to deploy.

    _images/k8-24.png
  28. Select the defaults unless specific ingress ports are to be used.

    _images/k8-25.png
  29. Select yes to deploy the Permission Manager.

    _images/k8-26.png
  30. Select both enabled and default for the Local path storage class.

    _images/k8-27.png
  31. Accept the default data storage path and leave the other two fields blank, which is the default.

    _images/k8-28.png
  32. Select Save config & deploy.

    _images/k8-29.png
  33. Change the filepath to /root/cm-kubernetes-setup-onprem.conf and select OK.

    This file can be used to redeploy K8s or copied and modified to deploy additional K8s clusters. W ait for the installation to finish.

    _images/k8-30.png
  34. Verify that the K8s cluster is installed properly.

     1# module load kubernetes/default/
     2# kubectl cluster-info
     3Kubernetes control plane is running at https://127.0.0.1:10443
     4CoreDNS is running at https://127.0.0.1:10443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
     5
     6To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
     7
     8# kubectl get nodes
     9NAME      STATUS   ROLES                  AGE     VERSION
    10dgx01     Ready    worker                 7m36s   v1.27.11
    11dgx02     Ready    worker                 7m37s   v1.27.11
    12dgx03     Ready    worker                 7m36s   v1.27.11
    13dgx04     Ready    worker                 7m37s   v1.27.11
    14knode01   Ready    control-plane,master   7m59s   v1.27.11
    15knode02   Ready    control-plane,master   7m26s   v1.27.11
    16knode03   Ready    control-plane,master   7m25s   v1.27.11
    
  35. Here, you can also verify the NVIDIA software has been successfully installed.

    1# ssh dgx01
    2# nvidia-smi
    3# nvsm show health
    4# dcgmi discovery -l