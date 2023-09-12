The K8s cluster in this solution is installed using Kubespray with a non-root user account from the Deployment Node.

Log into the Deployment Node as a deployment user (in this case - user) and create an SSH private key to configure the password-less authentication on your computer:

Deployment Node Console Collapse Source Copy Copied! $ sudo su - user $ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Created directory '/home/user/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@depl-node The key's randomart image is: +---[RSA 2048]----+ | ...+oo+o..o| | .oo .o. o| | . .. . o +.| | E . o + . +| | . S = + o | | . o = + o .| | . o.o + o| | ..+.*. o+o| | oo*ooo.++| +----[SHA256]-----+

Run the following commands to copy your SSH public key, such as ~/.ssh/id_rsa.pub, to all nodes in your deployment. The example shows node1 in the deployment.

Deployment Node Console Collapse Source Copy Copied! $ ssh-copy-id 10.10.1.1 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub" The authenticity of host '10.10.1.1 (10.10.1.1)' can't be established. ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys user@10.10.1.1's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'user@10.10.1.1'" and check to make sure that only the key(s) you wanted were added.

To verify that you have password-less SSH connectivity to all nodes in your deployment, run the following command:

Deployment Node Console Collapse Source Copy Copied! $ ssh user@10.10.1.1





To install dependencies for running Kubespray with Ansible on the Deployment server, run following commands:

Deployment Node Console Collapse Source Copy Copied! $ cd ~ $ sudo apt -y install python3-pip jq python3.10-venv $ git clone https://github.com/kubernetes-sigs/kubespray.git $ cd kubespray $ python3 -m venv .venv $ source .venv/bin/activate $ python3 -m pip install --upgrade pip $ pip install -U -r requirements.txt $ pip install ruamel-yaml

Create a new cluster configuration. The default folder for subsequent commands is ~/kubespray.

Replace the IP addresses below with the IP addresses of your nodes:

Deployment Node Console Collapse Source Copy Copied! $ cp -rfp inventory/sample inventory/mycluster $ declare -a IPS=(10.10.1.1 10.10.1.2 10.10.1.3 10.10.1.4 10.10.1.5) $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

The inventory/mycluster/hosts.yaml file is created.

Review and change the host configuration in the file. The following is an example for this deployment:

inventory/mycluster/hosts.yaml Collapse Source Copy Copied! $ vi inventory/mycluster/hosts.yaml all: hosts: node1: ansible_host: 10.10 . 1.1 ip: 10.10 . 1.1 access_ip: 10.10 . 1.1 node2: ansible_host: 10.10 . 1.2 ip: 10.10 . 1.2 access_ip: 10.10 . 1.2 node3: ansible_host: 10.10 . 1.3 ip: 10.10 . 1.3 access_ip: 10.10 . 1.3 node4: ansible_host: 10.10 . 1.4 ip: 10.10 . 1.4 access_ip: 10.10 . 1.4 node5: ansible_host: 10.10 . 1.5 ip: 10.10 . 1.5 access_ip: 10.10 . 1.5 children: kube_control_plane: hosts: node1: kube_node: hosts: node2: node3: node4: node5: etcd: hosts: node1: k8s_cluster: children: kube_control_plane: kube_node: calico_rr: hosts: {}

Note In the example deployment, there is 1 master node (node1) and 4 worker nodes (node2-5) so configure the hosts.yaml to be as follows: kube_control_plane: node1

kube_node: node2-5

etcd: node1

Review and change the cluster installation parameters in the inventory/mycluster/group_vars/all/all.yml and inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml files.

In the inventory/mycluster/group_vars/all/all.yml file, remove the comment from the following line to enable Kubelet to serve on a read-only API (for metrics exposure) with no authentication or authorization:

Deployment Node Console Collapse Source Copy Copied! $ sed -i 's/#\ kube_read_only_port:/kube_read_only_port:/g' inventory/mycluster/group_vars/all/all.yml

In the inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml file, set the value of kube_version to v1.29.0, set the container_manager to containerd, and make sure multi_networking is set to false - kube_network_plugin_multus: false (the plugin is installed later as part of NVIDIA network operator):

Deployment Node Console Collapse Source Copy Copied! $ vi inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml … ## Change this to use another Kubernetes version, e.g. a current beta release kube_version: v1. 29.0 … ## Container runtime ## docker for docker, crio for cri-o and containerd for containerd. ## Default: containerd container_manager: containerd … # Setting multi_networking to true will install Multus: https: kube_network_plugin_multus: false …

In the inventory/mycluster/group_vars/all/etcd.yml file, set the etcd_deployment_type to host:

Deployment Node Console Collapse Source Copy Copied! $ vi inventory/mycluster/group_vars/all/etcd.yml ... ## Settings for etcd deployment type # Set this to docker if you are using container_manager: docker etcd_deployment_type: host





To start the deployment process, run the following command:

Deployment Node Console Collapse Source Copy Copied! $ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete. Make sure there are no errors.

A successful result looks similar to the following:

Note Now that the K8s cluster is deployed, connect to the K8s Master Node for the following sections and use the root account (where the K8s cluster credentials are stored).





Below is an output example of a K8s cluster with the deployment information and with default Kubespray configuration using the Calico K8s CNI plugin.

To ensure that the K8s cluster is installed correctly, run the following commands:

Master Node Console Collapse Source Copy Copied! root@node1:~# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane 2m8s v1.29.0 10.10.1.1 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node2 Ready <none> 93s v1.29.0 10.10.1.2 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node3 Ready <none> 92s v1.29.0 10.10.1.3 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node4 Ready <none> 93s v1.29.0 10.10.1.4 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node5 Ready <none> 93s v1.29.0 10.10.1.5 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 root@node1:~# kubectl get pods -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-kube-controllers-68485cbf9c-6sf4h 1/1 Running 0 62s 10.233.102.143 node1 <none> <none> calico-node-fxpxl 1/1 Running 0 79s 10.10.1.2 node2 <none> <none> calico-node-k6qzp 1/1 Running 0 79s 10.10.1.5 node5 <none> <none> calico-node-mh4pp 1/1 Running 0 79s 10.10.1.4 node4 <none> <none> calico-node-mslh4 1/1 Running 0 79s 10.10.1.3 node3 <none> <none> calico-node-ngnxx 1/1 Running 0 79s 10.10.1.1 node1 <none> <none> coredns-69db55dd76-qq5mw 1/1 Running 0 51s 10.233.75.23 node2 <none> <none> coredns-69db55dd76-qrl6q 1/1 Running 0 54s 10.233.102.129 node1 <none> <none> dns-autoscaler-6f4b597d8c-5cmgz 1/1 Running 0 52s 10.233.102.130 node1 <none> <none> kube-apiserver-node1 1/1 Running 1 2m15s 10.10.1.1 node1 <none> <none> kube-controller-manager-node1 1/1 Running 2 2m15s 10.10.1.1 node1 <none> <none> kube-proxy-2hfcg 1/1 Running 0 98s 10.10.1.3 node3 <none> <none> kube-proxy-444mg 1/1 Running 0 98s 10.10.1.2 node2 <none> <none> kube-proxy-52ctj 1/1 Running 0 98s 10.10.1.4 node4 <none> <none> kube-proxy-7g9xv 1/1 Running 0 98s 10.10.1.1 node1 <none> <none> kube-proxy-zg6t2 1/1 Running 0 98s 10.10.1.5 node5 <none> <none> kube-scheduler-node1 1/1 Running 1 2m14s 10.10.1.1 node1 <none> <none> nginx-proxy-node2 1/1 Running 0 101s 10.10.1.2 node2 <none> <none> nginx-proxy-node3 1/1 Running 0 101s 10.10.1.3 node3 <none> <none> nginx-proxy-node4 1/1 Running 0 102s 10.10.1.4 node4 <none> <none> nginx-proxy-node5 1/1 Running 0 102s 10.10.1.5 node5 <none> <none> nodelocaldns-7tnjx 1/1 Running 0 52s 10.10.1.2 node2 <none> <none> nodelocaldns-qkm5t 1/1 Running 0 52s 10.10.1.4 node4 <none> <none> nodelocaldns-rhd9g 1/1 Running 0 52s 10.10.1.5 node5 <none> <none> nodelocaldns-tg5pm 1/1 Running 0 52s 10.10.1.3 node3 <none> <none> nodelocaldns-wlwkn 1/1 Running 0 52s 10.10.1.1 node1 <none> <none>





NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components and enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

You need to provision and configure several components. Perform all operator configuration and installation steps from the K8S master node with the root user account.

Install helm on the K8S master node:

Master Node Console Collapse Source Copy Copied! # wget https://get.helm.sh/helm-v3.15.1-linux-amd64.tar.gz # tar -zxvf helm-v3.15.1-linux-amd64.tar.gz # mv linux-amd64/helm /usr/local/bin/helm

Label the worker nodes:

Master Node Console Collapse Source Copy Copied! # for i in $(seq 2 5); do kubectl label nodes node$i node-role.kubernetes.io/worker=; done node/node2 labeled node/node3 labeled node/node4 labeled node/node5 labeled # kubectl get nodes NAME STATUS ROLES AGE VERSION node1 Ready control-plane 12d v1.29.0 node2 Ready worker 12d v1.29.0 node3 Ready worker 12d v1.29.0 node4 Ready worker 12d v1.29.0 node5 Ready worker 12d v1.29.0

Note K8s Worker Node labeling is required for a proper installation of the NVIDIA Network Operator.





Add the NVIDIA Network Operator Helm repository:

Master Node Console Collapse Source Copy Copied! # helm repo add nvidia https://helm.ngc.nvidia.com/nvidia # helm repo update

Install the operator with custom values; use a configuration file to override some of the default values.

Generate the values.yaml file:

Master Node Console Collapse Source Copy Copied! # helm show values nvidia/network-operator --version v24.4.0 > values.yaml

Edit the values.yaml file to enable SR-IOV support, the secondary network for K8S pods, and to install the MLNX_OFED driver as part of the operator deployment (necessary for GDR):

values.yaml Collapse Source Copy Copied! ... nfd: enabled: true ... sriovNetworkOperator: enabled: true ... # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true env: - name: UNLOAD_STORAGE_MODULES value: "true" ... rdmaSharedDevicePlugin: deploy: false ... sriovDevicePlugin: deploy: false ... secondaryNetwork: deploy: true cniPlugins: deploy: true ... multus: deploy: true ... ipamPlugin: deploy: true

Deploy the operator:

Master Node Console Collapse Source Copy Copied! # helm install --wait network-operator nvidia/network-operator -n nvidia-network-operator --create-namespace --version v24.4.0 -f ./values.yaml

After deployment, the SRIOV Network Operator is configured, and SriovNetworkNodePolicy and SriovNetwork are deployed.

You can speed up the deployment b efore you deploy the operator, by configuring SriovNetworkNodePool and setting the maxUnavailable parameter to 2 instead of 1 to drain more than 1 node at a time:

sriovnetwork-pool-config.yaml Collapse Source Copy Copied! apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkPoolConfig metadata: name: worker namespace: nvidia-network-operator spec: maxUnavailable: 2 nodeSelector: matchLabels: node-role.kubernetes.io/worker: ""

Apply the file:

Master Node Console Collapse Source Copy Copied! # kubectl apply -f sriovnetwork-pool-config.yaml

Create the configuration files and apply them.

sriovnetwork-node-policy.yaml configuration file example:

sriovnetwork-node-policy.yaml Collapse Source Copy Copied! apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy- 1 namespace: nvidia-network-operator spec: deviceType: netdevice mtu: 8950 nicSelector: vendor: "15b3" pfNames: [ "enp63s0f0np0" ] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 8 priority: 90 isRdma: true resourceName: sriov_rdma

sriovnetwork.yaml configuration file example:

sriovnetwork.yaml Collapse Source Copy Copied! apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: "sriov20" namespace: nvidia-network-operator spec: vlan: 20 spoofChk: "off" networkNamespace: "default" resourceName: "sriov_rdma" capabilities: '{ "mac": true }' ipam: |- { "datastore" : "kubernetes" , "kubernetes" : { "kubeconfig" : "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file" : "/tmp/whereabouts.log" , "log_level" : "debug" , "type" : "whereabouts" , "range" : "192.168.20.0/24" } metaPlugins : | { "type" : "rdma" }

Apply the configuration files described above:

Master Node Console Collapse Source Copy Copied! # kubectl apply -f sriovnetwork-node-policy.yaml # kubectl apply -f sriovnetwork.yaml

Wait for all required pods to be spawned:

Master Node Console Collapse Source Copy Copied! # kubectl get pod -n nvidia-network-operator NAME READY STATUS RESTARTS AGE cni-plugins-ds-bqpc5 1/1 Running 0 8h cni-plugins-ds-c98p7 1/1 Running 0 8h cni-plugins-ds-jrxss 1/1 Running 0 8h cni-plugins-ds-z65q4 1/1 Running 0 8h kube-multus-ds-fdfpq 1/1 Running 0 8h kube-multus-ds-kq6hr 1/1 Running 0 8h kube-multus-ds-lw666 1/1 Running 0 8h kube-multus-ds-nx5tb 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-47t7q 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-8hsl8 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-rhq7v 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-vmjxr 1/1 Running 0 8h network-operator-5b75d4455d-tdgqm 1/1 Running 0 8h network-operator-node-feature-discovery-master-568478db7d-k8l55 1/1 Running 0 8h network-operator-node-feature-discovery-worker-8r94l 1/1 Running 0 8h network-operator-node-feature-discovery-worker-bm6sm 1/1 Running 0 8h network-operator-node-feature-discovery-worker-d67xg 1/1 Running 0 8h network-operator-node-feature-discovery-worker-pnrn9 1/1 Running 0 8h network-operator-node-feature-discovery-worker-rgfrg 1/1 Running 0 8h network-operator-sriov-network-operator-6478f68965-tqlbb 1/1 Running 0 8h sriov-device-plugin-2nz4d 1/1 Running 0 8h sriov-device-plugin-8x64x 1/1 Running 0 8h sriov-device-plugin-vw7mh 1/1 Running 0 8h sriov-device-plugin-x4fnx 1/1 Running 0 8h sriov-device-plugin-zxlc8 1/1 Running 0 8h sriov-network-config-daemon-2w42j 1/1 Running 0 8h sriov-network-config-daemon-4t7bb 1/1 Running 0 8h sriov-network-config-daemon-fvl66 1/1 Running 0 8h sriov-network-config-daemon-gvjgh 1/1 Running 0 8h sriov-network-config-daemon-srbhs 1/1 Running 0 8h whereabouts-87wmm 1/1 Running 0 8h whereabouts-kkg9q 1/1 Running 0 8h whereabouts-qk4v2 1/1 Running 0 8h whereabouts-trx2q 1/1 Running 0 8h

Verify that a network attachment definition is created for the network and that the allocatable resources now include sriov_rdma identical to the number of VFs:

Master Node Console Collapse Source Copy Copied! # kubectl get net-attach-def NAME AGE sriov20 13m # kubectl describe net-attach-def sriov20 Name: sriov20 Namespace: default Labels: <none> Annotations: k8s.v1.cni.cncf.io/resourceName: nvidia.com/sriov_rdma API Version: k8s.cni.cncf.io/v1 Kind: NetworkAttachmentDefinition Metadata: Creation Timestamp: 2024-07-07T13:15:08Z Generation: 1 Resource Version: 5071113 UID: 3da65cc7-eab6-4cc6-8a0a-0be000c5ea2d Spec: Config: { "cniVersion": "0.3.1", "name": "sriov20", "plugins": [ { "type": "sriov", "vlan": 20, "spoofchk": "off", "vlanQoS": 0, "capabilities": { "mac": true }, "logLevel": "info", "ipam": { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.20.0/24" } }, { "type": "rdma" } ] } # for i in $(seq 2 5); do kubectl get node node$i -o json | jq '.status.allocatable."nvidia.com/sriov_rdma"'; done "8" "8" "8" "8"

The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision the GPU. These components include the NVIDIA drivers (to enable CUDA), the Kubernetes device plugin for the GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others. For information on platform support and getting started, visit the official documentation repository .

Install Helm on the K8S master node (done previously).

Add the NVIDIA GPU Operator Helm repository (same as with Network Operator):

Master Node Console Collapse Source Copy Copied! # helm repo add nvidia https://helm.ngc.nvidia.com/nvidia # helm repo update

Verify that NFD is running on the cluster (enabled through NVIDIA Network Operator). The output should be true for all the nodes:

Master Node Console Collapse Source Copy Copied! # kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))' true true true true true

Deploy the GPU Operator, enable GPUDirect RDMA, and disable the NFD plugin as it is already running in the cluster:

Master Node Console Collapse Source Copy Copied! # helm install --wait gpu-operator -n nvidia-gpu-operator --create-namespace nvidia/gpu-operator --set nfd.enabled=false --set driver.rdma.enabled=true NAME: gpu-operator LAST DEPLOYED: Wed Jun 19 10:40:35 2024 NAMESPACE: nvidia-gpu-operator STATUS: deployed REVISION: 1 TEST SUITE: None

Wait for all required pods to be spawned:

Master Node Console Collapse Source Copy Copied! # kubectl get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-2mx2x 1/1 Running 0 11m gpu-feature-discovery-gz5lm 1/1 Running 0 7m23s gpu-feature-discovery-vxfvp 1/1 Running 0 14m gpu-feature-discovery-wfhhl 1/1 Running 0 4m19s gpu-operator-7bbf8bb6b7-6mnrl 1/1 Running 0 20d nvidia-container-toolkit-daemonset-cg4h6 1/1 Running 0 11m nvidia-container-toolkit-daemonset-d9xr5 1/1 Running 0 7m23s nvidia-container-toolkit-daemonset-fqx7n 1/1 Running 0 14m nvidia-container-toolkit-daemonset-qj2rg 1/1 Running 0 4m19s nvidia-cuda-validator-8nmqs 0/1 Completed 0 5m51s nvidia-cuda-validator-dk9q2 0/1 Completed 0 13m nvidia-cuda-validator-mtmn8 0/1 Completed 0 2m44s nvidia-cuda-validator-zb9lc 0/1 Completed 0 9m45s nvidia-dcgm-exporter-227m9 1/1 Running 0 11m nvidia-dcgm-exporter-7lptj 1/1 Running 0 7m23s nvidia-dcgm-exporter-7pfvv 1/1 Running 0 4m19s nvidia-dcgm-exporter-cmg9x 1/1 Running 0 14m nvidia-device-plugin-daemonset-njjc7 1/1 Running 0 14m nvidia-device-plugin-daemonset-nnqgs 1/1 Running 0 11m nvidia-device-plugin-daemonset-p2hqd 1/1 Running 0 4m19s nvidia-device-plugin-daemonset-zqmbh 1/1 Running 0 7m23s nvidia-driver-daemonset-2vc5m 2/2 Running 0 8m11s nvidia-driver-daemonset-gst7x 2/2 Running 0 15m nvidia-driver-daemonset-hpw6m 2/2 Running 0 12m nvidia-driver-daemonset-xbm7n 2/2 Running 0 5m4s nvidia-mig-manager-5nph5 1/1 Running 0 7m23s nvidia-mig-manager-84txd 1/1 Running 0 14m nvidia-mig-manager-clfzv 1/1 Running 0 4m19s nvidia-mig-manager-npl2x 1/1 Running 0 11m nvidia-operator-validator-4h5rc 1/1 Running 0 11m nvidia-operator-validator-8krdh 1/1 Running 0 4m19s nvidia-operator-validator-8m7nk 1/1 Running 0 14m nvidia-operator-validator-g9qwj 1/1 Running 0 7m23s

Verify that the allocatable resources now include gpu, and that the NVIDIA kernel modules are loaded successfully on the worker nodes (in addition to the regular kernel modules, the nvidia-peermem kernel module must be loaded to enable GDR) :

Master Node Console Collapse Source Copy Copied! # for i in $(seq 2 5); do kubectl get node node$i -o json | jq '.status.allocatable."nvidia.com/gpu"'; done "2" "2" "2" "2"