Deploy Kubernetes#

With all required public cloud instances deployed and configured for general use, the environment is ready for K8s deployment. In a hybrid environment, the same tool used to deploy on-premises K8s is used to deploy K8s in the public cloud as well.

Run the cm-kubernetes-setup CLI wizard as the root user on the head node.
cm-kubernetes-setup
Choose Deploy to begin the deployment and then select Ok.
Choose Kubernetes v1.21 and then select Ok.

K8s version 1.21 was selected to match the version deployed in the on-premises DGX BasePOD deployment.
Choose Containerd (it should be selected by default) and then select Ok.
Optionally, provide a registry mirror and then select Ok.

This example deployment did not require one.
Configure the basic values of the K8s cluster and select Ok.

Choose names that make it easy to understand that the K8s deployment is using public cloud resources. In addition, ensure that the service and pod network subnets do not overlap with existing subnets in the cluster.
Choose yes to expose the K8s API server to the external network and then select Ok.

This allows users to use the K8s cluster from the head node.
Choose vpc-us-west-2-private for the public cloud-based K8s environment and then select Ok.

This keeps internal K8s traffic entirely in the public cloud.
Choose the three k8s-cloud-master nodes and then select Ok.
Choose k8s-cloud-gpu-worker for the worker node category and then select Ok.
Select Ok without configuring any individual K8s nodes.
Choose the three knode systems for Etcd nodes and then select Ok.
Configure the K8s main components and then select Ok.

Use the default ports and path here unless the environment requires different values. The default values were used in this deployment.
Choose the Calico network plugin and then select Ok.
Choose yes to install the Kyverno policy engine and then select Ok.
Choose no to decline to configure HA for Kyverno and then select Ok.

This deployment does not meet the minimum node requirement for Kyverno HA.
Choose whether to install Kyverno Policies and then select Ok.

Unless required for the configuration, choose no.
Choose the operator packages to install and then select Ok.

As shown in the screenshot, choose NVIDIA GPU Operator, Prometheus Adapter, Prometheus Adapter Stack, and the cm-jupyter-kernel-operator.
Choose the same four operators to be rolled up with the defaults and then select Ok.
Choose the addons to deploy and then select Ok.

As shown in the screenshot, choose Ingress Controller (Nginx), Kubernetes Dashboard, Kubernetes Metrics Server, and Kubernetes State Metrics.
Choose the Ingress ports for the cluster and then select Ok.

Use the defaults unless specific ingress ports are required.
Choose no when asked to install the Bright NVIDIA packages and then select Ok.

Since the K8s control plane nodes do not have GPUs, the GPU Operator manages NVIDIA OS components.
Choose yes to deploy the Permission Manager and then select Ok.
Select Ok without configuring any optional values.
Choose both enabled and default for the Local path storage class and then select Ok.
Select Ok without changing any of the default values.
Choose Save config & deploy and then select Ok.
Change the filepath to /root/cm-kubernetes-setup-cloud.conf and then select Ok.

The filepath was changed to avoid name conflicts with the existing K8s configuration file from the initial on-premises deployment. Wait for the installation to finish.

Verify the K8s cluster is installed properly.

The K8s module may need to be unloaded for the on-premises deployment if already loaded or use the switch command as a shortcut to unload on-premises and load the public cloud module.

module load kubernetes/aws-cloud/
kubectl cluster-info
Kubernetes control plane is running at https://localhost:10443
CoreDNS is running at https://localhost:10443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

kubectl get nodes
NAME                    STATUS   ROLES                  AGE     VERSION
us-west-2-gpu-node001   Ready    worker                 6m48s   v1.21.4
us-west-2-knode001      Ready    control-plane,master   6m48s   v1.21.4
us-west-2-knode002      Ready    control-plane,master   6m48s   v1.21.4
us-west-2-knode003      Ready    control-plane,master   6m48s   v1.21.4

Verify that a GPU job can be run on the K8s cluster.

Save the following text to a file named gpu.yaml.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-pytorch
spec:
  restartPolicy: Never
  containers:
    - name: pytorch-container
      image: nvcr.io/nvidia/pytorch:22.08-py3
      command:
        - nvidia-smi
      resources:
        limits:
          nvidia.com/gpu: 1

Execute the code using kubectl apply.
kubectl apply -f gpu.yaml

Use kubectl logs to check the result.

The output should be like the following.

kubectl logs gpu-pod-pytorch
Tue Feb 14 22:25:53 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+