First you will need to set up the repository.
Update the apt package index with the command below:
$ sudo apt-get update
Install packages to allow apt to use a repository over HTTPS:
$ sudo apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common
Next you will need to add Docker’s official GPG key with the command below:
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
Verify that you now have the key with the fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88, by searching for the last 8 characters of the fingerprint:
$ sudo apt-key fingerprint 0EBFCD88
pub rsa4096 2017-02-22 [SCEA]
9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
uid [ unknown] Docker Release (CE deb) <docker@docker.com>
sub rsa4096 2017-02-22 [S]
Use the following command to set up the stable repository:
$ sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs)\
stable"
Install Docker Engine - Community Update the apt package index:
$ sudo apt-get update
Install Docker Engine:
$ sudo apt-get install -y docker-ce=5:19.03.12~3-0~ubuntu-bionic docker-ce-cli=5:19.03.12~3-0~ubuntu-bionic containerd.io
Verify that Docker Engine - Community is installed correctly by running the hello-world image:
$ sudo docker run hello-world
More information on how to install Docker can be found here.
Make sure Docker has been started and enabled before beginning installation:
$ sudo systemctl start docker && sudo systemctl enable docker
Execute the following to add apt keys:
$ sudo apt-get update && sudo apt-get install -y apt-transport-https curl
$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
$ sudo mkdir -p /etc/apt/sources.list.d/
Create kubernetes.list:
$ cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
Now execute the below to install kubelet, kubeadm and kubectl:
$ sudo apt-get update
$ sudo apt-get install -y -q kubelet=1.21.1-00 kubectl=1.21.1-00 kubeadm=1.21.1-00
$ sudo apt-mark hold kubelet kubeadm kubectl
Reload the system daemon:
$ sudo systemctl daemon-reload
Disable swap
$ sudo swapoff -a
$ sudo nano /etc/fstab
Add a # before all the lines that start with /swap. # is a comment, and the result should look something like this:
UUID=e879fda9-4306-4b5b-8512-bba726093f1d / ext4 defaults 0 0
UUID=DCD4-535C /boot/efi vfat defaults 0 0
#/swap.img none swap sw 0 0
Execute the following command:
$ sudo kubeadm init --pod-network-cidr=192.168.0.0/16
Output:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join <your-host-IP>:6443 --token 489oi5.sm34l9uh7dk4z6cm \
--discovery-token-ca-cert-hash sha256:17165b6c4a4b95d73a3a2a83749a957a10161ae34d2dfd02cd730597579b4b34
Following the instructions in the output, execute the commands as shown below:
$ mkdir -p $HOME/.kube
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config
With the following command, you install a pod-network add-on to the control plane node. We are using calico as the pod-network add-on here:
$ kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
You can execute the below commands to ensure that all pods are up and running:
$ kubectl get pods --all-namespaces
Output:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-65b8787765-bjc8h 1/1 Running 0 2m8s
kube-system calico-node-c2tmk 1/1 Running 0 2m8s
kube-system coredns-5c98db65d4-d4kgh 1/1 Running 0 9m8s
kube-system coredns-5c98db65d4-h6x8m 1/1 Running 0 9m8s
kube-system etcd-#yourhost 1/1 Running 0 8m25s
kube-system kube-apiserver-#yourhost 1/1 Running 0 8m7s
kube-system kube-controller-manager-#yourhost 1/1 Running 0 8m3s
kube-system kube-proxy-6sh42 1/1 Running 0 9m7s
kube-system kube-scheduler-#yourhost 1/1 Running 0 8m26s
The get nodes command shows that the control-plane node is up and ready:
$ kubectl get nodes
Output:
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane,master 10m v1.21.1
Since we are using a single-node Kubernetes cluster, the cluster will not schedule pods on the control plane node by default. To schedule pods on the control plane node, we have to remove the taint by executing the following command:
$ kubectl taint nodes --all node-role.kubernetes.io/master-
Refer to https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ for more information.
Execute the following command to download and install Helm 3.5.4:
$ wget https://get.helm.sh/helm-v3.5.4-linux-amd64.tar.gz
$ tar -zxvf helm-v3.5.4-linux-amd64.tar.gz
$ sudo mv linux-amd64/helm /usr/local/bin/helm
$ rm -rf helm-v3.5.4-linux-amd64.tar.gz linux-amd64/
Refer to https://github.com/helm/helm/releases and https://helm.sh/docs/using_helm/#installing-helm for more information.
Prerequisites
If Mellanox NICs are not connected to your nodes, please skip this step and proceed to next step Installing GPU Operator.
The below instructions assume that Mellanox NICs are connected to your machines.
Execute the below command to verify Mellanox NICs are enabled on your machines:
$ lspci | grep -i "Mellanox"
Output:
0c:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
0c:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
Execute the below command to know which Mellanox Device is Active:
Use the Device whichever shows as Link Detected: yes
in further steps. Below command works only if you add the NICs before installing the Operating System.
for device in `sudo lshw -class network -short | grep -i ConnectX | awk '{print $2}' | egrep -v 'Device|path' | sed '/^$/d'`;do echo -n $device; sudo ethtool $device | grep -i "Link detected"; done
Output:
ens160f0 Link detected: yes
ens160f1 Link detected: no
Create the custom network operator values.yaml
.
$ nano network-operator-values.yaml
Update the active Mellanox device from the above command.
deployCR: true
ofedDriver:
deploy: true
nvPeerDriver:
deploy: true
rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
vendors: [15b3]
devices: [ens160f0]
For more information about custom network operator values.yaml
, please refer Network Operator.
Add the NVIDIA repo:
Helm is required to install Network Operator.
$ helm repo add mellanox https://mellanox.github.io/network-operator
Update the Helm repo:
$ helm repo update
Install NVIDIA Network Operator
Execute the commands below:
$ kubectl label nodes --all node-role.kubernetes.io/master- --overwrite
$ helm install -f ./network-operator-values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator
Validating the State of Network Operator
Please note that the installation of the Network Operator can take a couple of minutes. How long the installation will take depends on your internet speed.
kubectl get pods --all-namespaces | egrep 'network-operator|nvidia-network-operator-resources'
NAMESPACE NAME READY STATUS RESTARTS AGE
network-operator network-operator-547cb8d999-mn2h9 1/1 Running 0 17m
network-operator network-operator-node-feature-discovery-master-596fb8b7cb-qrmvv 1/1 Running 0 17m
network-operator network-operator-node-feature-discovery-worker-qt5xt 1/1 Running 0 17m
nvidia-network-operator-resources cni-plugins-ds-dl5vl 1/1 Running 0 17m
nvidia-network-operator-resources kube-multus-ds-w82rv 1/1 Running 0 17m
nvidia-network-operator-resources mofed-ubuntu20.04-ds-xfpzl 1/1 Running 0 17m
nvidia-network-operator-resources rdma-shared-dp-ds-2hgb6 1/1 Running 0 17m
nvidia-network-operator-resources sriov-device-plugin-ch7bz 1/1 Running 0 10m
nvidia-network-operator-resources whereabouts-56ngr 1/1 Running 0 10m
Please refer to the Network Operator page for more information.
NVIDIA AI Enterprise customers have access to a pre-configured GPU Operator within the NVIDIA Enterprise Catalog. The GPU Operator is pre-configured to simplify the provisioning experience with NVIDIA AI Enterprise deployments.
The pre-configured GPU Operator differs from the GPU Operator in the public NGC catalog. The differences are:
It is configured to use a prebuilt vGPU driver image (Only available to NVIDIA AI Enterprise customers).
It is configured to use the NVIDIA License System (NLS).
Install GPU Operator
The GPU Operator with NVIDIA AI Enterprise requires some tasks to be completed prior to installation. Refer to the document NVIDIA AI Enterprise for instructions prior to running the below commands.
NVIDIA GPU Operator Install scripts are also available here.
Add the NVIDIA AI Enterprise Helm repository, where api-key
is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated:
$ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=api-key && helm repo update
$ helm install --wait --generate-name nvaie/gpu-operator -n gpu-operator
License GPU Operator
Copy the NLS license token in the file named
client_configuration_token.tok
.Create an empty gridd.conf file.
touch gridd.conf
Create Configmap for the NLS Licensing.
kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok
Create K8s Secret to Access NGC registry.
kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator
GPU Operator with RDMA (Optional)
Prerequisites
Please install the Network Operator to ensure that the MOFED drivers are installed.
After NVIDIA Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules.
$ helm install --wait gpu-operator nvaie/gpu-operator -n gpu-operator --set driver.rdma.enabled=true
Validating the Network Operator with GPUDirect RDMA
Execute the below command to list the Mellanox NIC’s with the status:
$ kubectl exec -it $(kubectl get pods -n nvidia-network-operator-resources | grep mofed | awk '{print $1}') -n nvidia-network-operator-resources -- ibdev2netdev
Output:
mlx5_0 port 1 ==> ens192f0 (Up)
mlx5_1 port 1 ==> ens192f1 (Down)
Edit the networkdefinition.yaml
.
$ nano networkdefinition.yaml
Create network definition for IPAM and replace the ens192f0
with active Mellanox device for master
.
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a
name: rdma-net-ipam
namespace: default
spec:
config: |-
{
"cniVersion": "0.3.1",
"name": "rdma-net-ipam",
"plugins": [
{
"ipam": {
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"log_file": "/tmp/whereabouts.log",
"log_level": "debug",
"range": "192.168.111.0/24",
"type": "whereabouts"
},
"type": "macvlan",
"master": "ens192f0",
"vlan": 111
},
{
"mtu": 1500,
"type": "tuning"
}
]
}
EOF
If you do not have VLAN-based networking on the high-performance side, please set “vlan”: 0
Validate the state of the GPU Operator
Please note that the installation of the GPU Operator can take a couple of minutes. How long the installation will take depends on your internet speed.
kubectl get pods --all-namespaces | grep -v kube-system
Results:
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-operator-1622656274-node-feature-discovery-master-5cddq96gq 1/1 Running 0 2m39s
default gpu-operator-1622656274-node-feature-discovery-worker-wr88v 1/1 Running 0 2m39s
default gpu-operator-7db468cfdf-mdrdp 1/1 Running 0 2m39s
gpu-operator-resources gpu-feature-discovery-g425f 1/1 Running 0 2m20s
gpu-operator-resources nvidia-container-toolkit-daemonset-mcmxj 1/1 Running 0 2m20s
gpu-operator-resources nvidia-cuda-validator-s6x2p 0/1 Completed 0 48s
gpu-operator-resources nvidia-dcgm-exporter-wtxnx 1/1 Running 0 2m20s
gpu-operator-resources nvidia-dcgm-jbz94 1/1 Running 0 2m20s
gpu-operator-resources nvidia-device-plugin-daemonset-hzzdt 1/1 Running 0 2m20s
gpu-operator-resources nvidia-device-plugin-validator-9nkxq 0/1 Completed 0 17s
gpu-operator-resources nvidia-driver-daemonset-kt8g5 1/1 Running 0 2m20s
gpu-operator-resources nvidia-operator-validator-cw4j5 1/1 Running 0 2m20s
Please refer to the GPU Operator page on NGC for more information.