Deploy Kubernetes

With all required public cloud instances deployed and configured for general use, the environment is ready for K8s deployment. In a hybrid environment, the same tool used to deploy on-premises K8s is used to deploy K8s in the public cloud as well.

  1. Run the cm-kubernetes-setup CLI wizard as the root user on the head node.

    cm-kubernetes-setup
    
  2. Choose Deploy to begin the deployment and then select Ok.

    _images/deploy-kubernetes-01.png
  3. Choose Kubernetes v1.21 and then select Ok.

    _images/deploy-kubernetes-02.png

    K8s version 1.21 was selected to match the version deployed in the on-premises DGX BasePOD deployment.

  4. Choose Containerd (it should be selected by default) and then select Ok.

    _images/deploy-kubernetes-03.png
  5. Optionally, provide a registry mirror and then select Ok.

    This example deployment did not require one.

    _images/deploy-kubernetes-04.png
  6. Configure the basic values of the K8s cluster and select Ok.

    _images/deploy-kubernetes-05.png

    Choose names that make it easy to understand that the K8s deployment is using public cloud resources. In addition, ensure that the service and pod network subnets do not overlap with existing subnets in the cluster.

  7. Choose yes to expose the K8s API server to the external network and then select Ok.

    _images/deploy-kubernetes-06.png

    This allows users to use the K8s cluster from the head node.

  8. Choose vpc-us-west-2-private for the public cloud-based K8s environment and then select Ok.

    _images/deploy-kubernetes-07.png

    This keeps internal K8s traffic entirely in the public cloud.

  9. Choose the three k8s-cloud-master nodes and then select Ok.

    _images/deploy-kubernetes-08.png
  10. Choose k8s-cloud-gpu-worker for the worker node category and then select Ok.

    _images/deploy-kubernetes-09.png
  11. Select Ok without configuring any individual K8s nodes.

    _images/deploy-kubernetes-10.png
  12. Choose the three knode systems for Etcd nodes and then select Ok.

    _images/deploy-kubernetes-11.png
  13. Configure the K8s main components and then select Ok.

    _images/deploy-kubernetes-12.png

    Use the default ports and path here unless the environment requires different values. The default values were used in this deployment.

  14. Choose the Calico network plugin and then select Ok.

    _images/deploy-kubernetes-13.png
  15. Choose yes to install the Kyverno policy engine and then select Ok.

    _images/deploy-kubernetes-14.png
  16. Choose no to decline to configure HA for Kyverno and then select Ok.

    This deployment does not meet the minimum node requirement for Kyverno HA.

    _images/deploy-kubernetes-15.png
  17. Choose whether to install Kyverno Policies and then select Ok.

    Unless required for the configuration, choose no.

    _images/deploy-kubernetes-16.png
  18. Choose the operator packages to install and then select Ok.

    _images/deploy-kubernetes-17.png

    As shown in the screenshot, choose NVIDIA GPU Operator, Prometheus Adapter, Prometheus Adapter Stack, and the cm-jupyter-kernel-operator.

  19. Choose the same four operators to be rolled up with the defaults and then select Ok.

    _images/deploy-kubernetes-18.png
  20. Choose the addons to deploy and then select Ok.

    _images/deploy-kubernetes-19.png

    As shown in the screenshot, choose Ingress Controller (Nginx), Kubernetes Dashboard, Kubernetes Metrics Server, and Kubernetes State Metrics.

  21. Choose the Ingress ports for the cluster and then select Ok.

    Use the defaults unless specific ingress ports are required.

    _images/deploy-kubernetes-20.png
  22. Choose no when asked to install the Bright NVIDIA packages and then select Ok.

    _images/deploy-kubernetes-21.png

    Since the K8s control plane nodes do not have GPUs, the GPU Operator manages NVIDIA OS components.

  23. Choose yes to deploy the Permission Manager and then select Ok.

    _images/deploy-kubernetes-22.png
  24. Select Ok without configuring any optional values.

    _images/deploy-kubernetes-23.png
  25. Choose both enabled and default for the Local path storage class and then select Ok.

    _images/deploy-kubernetes-24.png
  26. Select Ok without changing any of the default values.

    _images/deploy-kubernetes-25.png
  27. Choose Save config & deploy and then select Ok.

    _images/deploy-kubernetes-26.png
  28. Change the filepath to /root/cm-kubernetes-setup-cloud.conf and then select Ok.

    _images/deploy-kubernetes-01.png

    The filepath was changed to avoid name conflicts with the existing K8s configuration file from the initial on-premises deployment. Wait for the installation to finish.

  29. Verify the K8s cluster is installed properly.

    The K8s module may need to be unloaded for the on-premises deployment if already loaded or use the switch command as a shortcut to unload on-premises and load the public cloud module.

    1module load kubernetes/aws-cloud/
    2kubectl cluster-info
    3Kubernetes control plane is running at https://localhost:10443
    4CoreDNS is running at https://localhost:10443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
    5
    6To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
    
    1kubectl get nodes
    2NAME                    STATUS   ROLES                  AGE     VERSION
    3us-west-2-gpu-node001   Ready    worker                 6m48s   v1.21.4
    4us-west-2-knode001      Ready    control-plane,master   6m48s   v1.21.4
    5us-west-2-knode002      Ready    control-plane,master   6m48s   v1.21.4
    6us-west-2-knode003      Ready    control-plane,master   6m48s   v1.21.4
    
  30. Verify that a GPU job can be run on the K8s cluster.

    1. Save the following text to a file named gpu.yaml.

       1apiVersion: v1
       2kind: Pod
       3metadata:
       4  name: gpu-pod-pytorch
       5spec:
       6  restartPolicy: Never
       7  containers:
       8    - name: pytorch-container
       9      image: nvcr.io/nvidia/pytorch:22.08-py3
      10      command:
      11        - nvidia-smi
      12      resources:
      13        limits:
      14          nvidia.com/gpu: 1
      
    2. Execute the code using kubectl apply.

      kubectl apply -f gpu.yaml
      
    3. Use kubectl logs to check the result.

      The output should be like the following.

       1kubectl logs gpu-pod-pytorch
       2Tue Feb 14 22:25:53 2023
       3+-----------------------------------------------------------------------------+
       4| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
       5|-------------------------------+----------------------+----------------------+
       6| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
       7| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
       8|                               |                      |               MIG M. |
       9|===============================+======================+======================|
      10|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
      11| N/A   28C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
      12|                               |                      |                  N/A |
      13+-------------------------------+----------------------+----------------------+
      14
      15+-----------------------------------------------------------------------------+
      16| Processes:                                                                  |
      17|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
      18|        ID   ID                                                   Usage      |
      19|=============================================================================|
      20|  No running processes found                                                 |
      21+-----------------------------------------------------------------------------+