If Mellanox NICs are not connected to your nodes, please skip this step and proceed to next step Installing GPU Operator.
The below instructions assume that Mellanox NICs are connected to your machines.
Execute the below command to verify Mellanox NICs are enabled on your machines:
$ lspci | grep -i "Mellanox"
Output:
0c:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
0c:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
Execute the below command to know which Mellanox Device is Active:
Use the Device whichever shows as Link Detected: yes
in further steps. Below command works only if you add the NICs before installing the Operating System.
for device in `sudo lshw -class network -short | grep -i ConnectX | awk '{print $2}' | egrep -v 'Device|path' | sed '/^$/d'`;do echo -n $device; sudo ethtool $device | grep -i "Link detected"; done
Output:
ens160f0 Link detected: yes
ens160f1 Link detected: no
Create the custom network operator values.yaml
.
$ nano network-operator-values.yaml
Update the active Mellanox device from the above command.
deployCR: true
ofedDriver:
deploy: true
nvPeerDriver:
deploy: true
rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
vendors: [15b3]
devices: [ens160f0]
For more information about custom network operator values.yaml
, please refer Network Operator.
Add the NVIDIA repo:
Helm is required to install GPU Operator.
$ helm repo add mellanox https://mellanox.github.io/network-operator
Update the Helm repo:
$ helm repo update
Execute the commands below:
$ kubectl label nodes --all node-role.kubernetes.io/master- --overwrite
$ helm install -f ./network-operator-values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator
Validating the State of Network Operator
Please note that the installation of the Network Operator can take a couple of minutes. How long the installation will take depends on your internet speed.
kubectl get pods --all-namespaces | egrep 'network-operator|nvidia-network-operator-resources'
NAMESPACE NAME READY STATUS RESTARTS AGE
network-operator network-operator-547cb8d999-mn2h9 1/1 Running 0 17m
network-operator network-operator-node-feature-discovery-master-596fb8b7cb-qrmvv 1/1 Running 0 17m
network-operator network-operator-node-feature-discovery-worker-qt5xt 1/1 Running 0 17m
nvidia-network-operator-resources cni-plugins-ds-dl5vl 1/1 Running 0 17m
nvidia-network-operator-resources kube-multus-ds-w82rv 1/1 Running 0 17m
nvidia-network-operator-resources mofed-ubuntu20.04-ds-xfpzl 1/1 Running 0 17m
nvidia-network-operator-resources rdma-shared-dp-ds-2hgb6 1/1 Running 0 17m
nvidia-network-operator-resources sriov-device-plugin-ch7bz 1/1 Running 0 10m
nvidia-network-operator-resources whereabouts-56ngr 1/1 Running 0 10m
Please refer to the Network Operator page for more information.