Running with Kubernetes

NVIDIA AI Enterprise 2.0 or later

First you will need to set up the repository.

Update the apt package index with the command below:

Copy
Copied!
            

sudo apt-get update

Install packages to allow apt to use a repository over HTTPS:

Copy
Copied!
            

sudo apt-get install -y \ apt-transport-https \ ca-certificates \ curl \ gnupg-agent \ software-properties-common

Next you will need to add Docker’s official GPG key with the command below:

Copy
Copied!
            

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Verify that you now have the key with the fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88, by searching for the last 8 characters of the fingerprint:

Copy
Copied!
            

sudo apt-key fingerprint 0EBFCD88 pub rsa4096 2017-02-22 [SCEA] 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 uid [ unknown] Docker Release (CE deb) <docker@docker.com> sub rsa4096 2017-02-22 [S]

Use the following command to set up the stable repository:

Copy
Copied!
            

sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs)\ stable"

Install Docker Engine - Community Update the apt package index:

Copy
Copied!
            

sudo apt-get update

Install Docker Engine:

Please refer to Install Docker Engine on Ubuntu | Docker Documentation for a current installation procedure for Ubuntu.

Verify that Docker Engine - Community is installed correctly by running the hello-world image:

Copy
Copied!
            

sudo docker run hello-world

More information on how to install Docker can be found in the Installing Docker section.

Make sure Docker has been started and enabled before beginning installation:

Copy
Copied!
            

sudo systemctl start docker && sudo systemctl enable docker

Execute the following to add apt keys:

Copy
Copied!
            

sudo apt-get update && sudo apt-get install -y apt-transport-https curl curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - sudo mkdir -p /etc/apt/sources.list.d/

Create kubernetes.list:

Copy
Copied!
            

cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list deb https://apt.kubernetes.io/ kubernetes-xenial main EOF

Now execute the below to install kubelet, kubeadm and kubectl:

Copy
Copied!
            

sudo apt-get update sudo apt-get install -y -q kubelet=1.21.1-00 kubectl=1.21.1-00 kubeadm=1.21.1-00 sudo apt-mark hold kubelet kubeadm kubectl

Reload the system daemon:

Copy
Copied!
            

sudo systemctl daemon-reload

Disable swap

Copy
Copied!
            

sudo swapoff -a sudo nano /etc/fstab

Note

Add a # before all the lines that start with /swap. # is a comment, and the result should look something like this:

Copy
Copied!
            

UUID=e879fda9-4306-4b5b-8512-bba726093f1d / ext4 defaults 0 0 UUID=DCD4-535C /boot/efi vfat defaults 0 0 #/swap.img none swap sw 0 0

Execute the following command:

Copy
Copied!
            

sudo kubeadm init --pod-network-cidr=192.168.0.0/16

Output:

Copy
Copied!
            

Your Kubernetes control-plane has initialized successfully! To start using your cluster, you need to run the following as a regular user: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config Alternatively, if you are the root user, you can run: export KUBECONFIG=/etc/kubernetes/admin.conf You should now deploy a pod network to the cluster. Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at: https://kubernetes.io/docs/concepts/cluster-administration/addons/ Then you can join any number of worker nodes by running the following on each as root: kubeadm join <your-host-IP>:6443 --token 489oi5.sm34l9uh7dk4z6cm \ --discovery-token-ca-cert-hash sha256:17165b6c4a4b95d73a3a2a83749a957a10161ae34d2dfd02cd730597579b4b34

Following the instructions in the output, execute the commands as shown below:

Copy
Copied!
            

mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config

With the following command, you install a pod-network add-on to the control plane node. We are using calico as the pod-network add-on here:

Copy
Copied!
            

kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml

You can execute the below commands to ensure that all pods are up and running:

Copy
Copied!
            

kubectl get pods --all-namespaces

Output:

Copy
Copied!
            

NAMESPACE NAME READY STATUS RESTARTS AGE kube-system calico-kube-controllers-65b8787765-bjc8h 1/1 Running 0 2m8s kube-system calico-node-c2tmk 1/1 Running 0 2m8s kube-system coredns-5c98db65d4-d4kgh 1/1 Running 0 9m8s kube-system coredns-5c98db65d4-h6x8m 1/1 Running 0 9m8s kube-system etcd-#yourhost 1/1 Running 0 8m25s kube-system kube-apiserver-#yourhost 1/1 Running 0 8m7s kube-system kube-controller-manager-#yourhost 1/1 Running 0 8m3s kube-system kube-proxy-6sh42 1/1 Running 0 9m7s kube-system kube-scheduler-#yourhost 1/1 Running 0 8m26s

The get nodes command shows that the control-plane node is up and ready:

Copy
Copied!
            

kubectl get nodes

Output:

Copy
Copied!
            

NAME STATUS ROLES AGE VERSION #yourhost Ready control-plane,master 10m v1.21.1

Since we are using a single-node Kubernetes cluster, the cluster will not schedule pods on the control plane node by default. To schedule pods on the control plane node, we have to remove the taint by executing the following command:

Copy
Copied!
            

kubectl taint nodes --all node-role.kubernetes.io/master-

Refer to https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ for more information.

Execute the following command to download and install Helm 3.5.4:

Copy
Copied!
            

wget https://get.helm.sh/helm-v3.5.4-linux-amd64.tar.gz tar -zxvf helm-v3.5.4-linux-amd64.tar.gz sudo mv linux-amd64/helm /usr/local/bin/helm rm -rf helm-v3.5.4-linux-amd64.tar.gz linux-amd64/

Refer to https://github.com/helm/helm/releases and https://helm.sh/docs/using_helm/#installing-helm for more information.

NVIDIA AI Enterprise 2.0 or later

Prerequisites

Note

If Mellanox NICs are not connected to your nodes, please skip this step and proceed to NVIDIA GPU Operator.

The below instructions assume that Mellanox NICs are connected to your machines.

Execute the below command to verify Mellanox NICs are enabled on your machines:

Copy
Copied!
            

lspci | grep -i "Mellanox"

Output:

Copy
Copied!
            

0c:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] 0c:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

Execute the below command to know which Mellanox Device is Active:

Note

Use the Device whichever shows as Link Detected: yes in further steps. Below command works only if you add the NICs before installing the Operating System.

Copy
Copied!
            

for device in `sudo lshw -class network -short | grep -i ConnectX | awk '{print $2}' | egrep -v 'Device|path' | sed '/^$/d'`;do echo -n $device; sudo ethtool $device | grep -i "Link detected"; done

Output:

Copy
Copied!
            

ens160f0 Link detected: yes ens160f1 Link detected: no

Create the custom network operator values.yaml.

Copy
Copied!
            

nano network-operator-values.yaml

Update the active Mellanox device from the above command.

Copy
Copied!
            

deployCR: true ofedDriver: deploy: true nvPeerDriver: deploy: true rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a vendors: [15b3] devices: [ens160f0]

For more information about custom network operator values.yaml, please refer Network Operator.

Add the NVIDIA repo:

Note

Installing Helm is required to install GPU Operator.

Copy
Copied!
            

helm repo add mellanox https://mellanox.github.io/network-operator

Update the Helm repo:

Copy
Copied!
            

helm repo update

Install NVIDIA Network Operator

Execute the commands below:

Copy
Copied!
            

kubectl label nodes --all node-role.kubernetes.io/master- --overwrite helm install -f ./network-operator-values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator

Validating the State of Network Operator

Please note that the installation of the Network Operator can take a couple of minutes. How long the installation will take depends on your internet speed.

Copy
Copied!
            

kubectl get pods --all-namespaces | egrep 'network-operator|nvidia-network-operator-resources'

Copy
Copied!
            

NAMESPACE NAME READY STATUS RESTARTS AGE network-operator network-operator-547cb8d999-mn2h9 1/1 Running 0 17m network-operator network-operator-node-feature-discovery-master-596fb8b7cb-qrmvv 1/1 Running 0 17m network-operator network-operator-node-feature-discovery-worker-qt5xt 1/1 Running 0 17m nvidia-network-operator-resources cni-plugins-ds-dl5vl 1/1 Running 0 17m nvidia-network-operator-resources kube-multus-ds-w82rv 1/1 Running 0 17m nvidia-network-operator-resources mofed-ubuntu20.04-ds-xfpzl 1/1 Running 0 17m nvidia-network-operator-resources rdma-shared-dp-ds-2hgb6 1/1 Running 0 17m nvidia-network-operator-resources sriov-device-plugin-ch7bz 1/1 Running 0 10m nvidia-network-operator-resources whereabouts-56ngr 1/1 Running 0 10m

Please refer to the Network Operator page for more information.

NVIDIA AI Enterprise 2.0 or later

NVIDIA AI Enterprise customers have access to a pre-configured GPU Operator within the NVIDIA Enterprise Catalog. The GPU Operator is pre-configured to simplify the provisioning experience with NVIDIA AI Enterprise deployments.

The pre-configured GPU Operator differs from the GPU Operator in the public NGC catalog. The differences are:

  • It is configured to use a prebuilt vGPU driver image (Only available to NVIDIA AI Enterprise customers).

  • It is configured to use the NVIDIA License System (NLS).

Install GPU Operator

Note

The GPU Operator with NVIDIA AI Enterprise requires some tasks to be completed prior to installation. Refer to the document NVIDIA AI Enterprise for instructions prior to running the below commands.

License GPU Operator for CLS

Add the NVIDIA AI Enterprise Helm repository, where api-key is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated.

Copy
Copied!
            

helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=api-key && helm repo update

  1. Copy the NLS license token in the file named client_configuration_token.tok.

  2. Create an empty gridd.conf file using the command below.

    Copy
    Copied!
                

    touch gridd.conf


  3. Create Configmap for the NLS Licensing using the command below.

    Copy
    Copied!
                

    kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok


  4. Create K8s Secret to Access NGC registry.

    Copy
    Copied!
                

    kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’


  5. Install the GPU Operator with the command below.

    Copy
    Copied!
                

    helm install --wait --generate-name nvaie/gpu-operator -n gpu-operator


License GPU Operator for DLS

Add the NVIDIA AI Enterprise Helm repository, where api-key is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated.

Copy
Copied!
            

helm repo add nvidia https://nvidia.github.io/gpu-operator \ && helm repo update


Prior to GPU Operator v1.9, the operator was installed in the default namespace while all operands were installed in the gpu-operator-resources namespace.

Starting with GPU Operator v1.9, both the operator and operands get installed in the same namespace. The namespace is configurable and is determined during installation. For example, to install the GPU Operator in the gpu-operator namespace.

Copy
Copied!
            

helm install --wait --generate-name \ -n gpu-operator --create-namespace nvidia/gpu-operator


If a namespace is not specified during installation, all GPU Operator components will be installed in the default namespace.

GPU Operator with RDMA (Optional)

Prerequisites

After NVIDIA Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules.

Copy
Copied!
            

helm install --wait gpu-operator nvaie/gpu-operator -n gpu-operator --set driver.rdma.enabled=true

Validating the Network Operator with GPUDirect RDMA

Execute the below command to list the Mellanox NIC’s with the status:

Copy
Copied!
            

kubectl exec -it $(kubectl get pods -n nvidia-network-operator-resources | grep mofed | awk '{print $1}') -n nvidia-network-operator-resources -- ibdev2netdev

Output:

Copy
Copied!
            

mlx5_0 port 1 ==> ens192f0 (Up) mlx5_1 port 1 ==> ens192f1 (Down)

Edit the networkdefinition.yaml.

Copy
Copied!
            

nano networkdefinition.yaml

Create network definition for IPAM and replace the ens192f0 with active Mellanox device for master.

Copy
Copied!
            

apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a name: rdma-net-ipam namespace: default spec: config: |- { "cniVersion": "0.3.1", "name": "rdma-net-ipam", "plugins": [ { "ipam": { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "range": "192.168.111.0/24", "type": "whereabouts" }, "type": "macvlan", "master": "ens192f0", "vlan": 111 }, { "mtu": 1500, "type": "tuning" } ] } EOF

Note

If you do not have VLAN-based networking on the high-performance side, please set “vlan”: 0

Validate the state of the GPU Operator

Please note that the installation of the GPU Operator can take a couple of minutes. How long the installation will take depends on your internet speed.

Copy
Copied!
            

kubectl get pods --all-namespaces | grep -v kube-system

Results:

Copy
Copied!
            

NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1622656274-node-feature-discovery-master-5cddq96gq 1/1 Running 0 2m39s default gpu-operator-1622656274-node-feature-discovery-worker-wr88v 1/1 Running 0 2m39s default gpu-operator-7db468cfdf-mdrdp 1/1 Running 0 2m39s gpu-operator-resources gpu-feature-discovery-g425f 1/1 Running 0 2m20s gpu-operator-resources nvidia-container-toolkit-daemonset-mcmxj 1/1 Running 0 2m20s gpu-operator-resources nvidia-cuda-validator-s6x2p 0/1 Completed 0 48s gpu-operator-resources nvidia-dcgm-exporter-wtxnx 1/1 Running 0 2m20s gpu-operator-resources nvidia-dcgm-jbz94 1/1 Running 0 2m20s gpu-operator-resources nvidia-device-plugin-daemonset-hzzdt 1/1 Running 0 2m20s gpu-operator-resources nvidia-device-plugin-validator-9nkxq 0/1 Completed 0 17s gpu-operator-resources nvidia-driver-daemonset-kt8g5 1/1 Running 0 2m20s gpu-operator-resources nvidia-operator-validator-cw4j5 1/1 Running 0 2m20s

Please refer to the GPU Operator page on NGC for more information.

© Copyright 2022-2023, NVIDIA. Last updated on Nov 7, 2023.