Installing NVIDIA Network Operator

Bare Metal Deployment Guide (0.1.0)

NVIDIA AI Enterprise 2.0 or later

Note

If Mellanox NICs are not connected to your nodes, please skip this step and proceed to Installing NVIDIA GPU Operator section.

The below instructions assume that Mellanox NICs are connected to your machines.

Execute the below command to verify Mellanox NICs are enabled on your machines:

Copy
Copied!
            

$ lspci | grep -i "Mellanox"

Output:

Copy
Copied!
            

0c:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] 0c:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

Execute the below command to know which Mellanox Device is Active:

Note

Use the Device whichever shows as Link Detected: yes in further steps. Below command works only if you add the NICs before installing the Operating System.

Copy
Copied!
            

for device in `sudo lshw -class network -short | grep -i ConnectX | awk '{print $2}' | egrep -v 'Device|path' | sed '/^$/d'`;do echo -n $device; sudo ethtool $device | grep -i "Link detected"; done

Output:

Copy
Copied!
            

ens160f0 Link detected: yes ens160f1 Link detected: no

Create the custom network operator values.yaml.

Copy
Copied!
            

$ nano network-operator-values.yaml

Update the active Mellanox device from the above command.

Copy
Copied!
            

deployCR: true ofedDriver: deploy: true nvPeerDriver: deploy: true rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a vendors: [15b3] devices: [ens160f0]

For more information about custom network operator values.yaml, please refer Network Operator.

Add the NVIDIA repo:

Note

Installing Helm is required to install GPU Operator.

Copy
Copied!
            

$ helm repo add mellanox https://mellanox.github.io/network-operator

Update the Helm repo:

Copy
Copied!
            

$ helm repo update

Execute the commands below:

Copy
Copied!
            

$ kubectl label nodes --all node-role.kubernetes.io/master- --overwrite $ helm install -f ./network-operator-values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator

Validating the State of Network Operator

Please note that the installation of the Network Operator can take a couple of minutes. How long the installation will take depends on your internet speed.

Copy
Copied!
            

kubectl get pods --all-namespaces | egrep 'network-operator|nvidia-network-operator-resources'

Copy
Copied!
            

NAMESPACE NAME READY STATUS RESTARTS AGE network-operator network-operator-547cb8d999-mn2h9 1/1 Running 0 17m network-operator network-operator-node-feature-discovery-master-596fb8b7cb-qrmvv 1/1 Running 0 17m network-operator network-operator-node-feature-discovery-worker-qt5xt 1/1 Running 0 17m nvidia-network-operator-resources cni-plugins-ds-dl5vl 1/1 Running 0 17m nvidia-network-operator-resources kube-multus-ds-w82rv 1/1 Running 0 17m nvidia-network-operator-resources mofed-ubuntu20.04-ds-xfpzl 1/1 Running 0 17m nvidia-network-operator-resources rdma-shared-dp-ds-2hgb6 1/1 Running 0 17m nvidia-network-operator-resources sriov-device-plugin-ch7bz 1/1 Running 0 10m nvidia-network-operator-resources whereabouts-56ngr 1/1 Running 0 10m

Please refer to the Network Operator page for more information.

© Copyright 2022-2023, NVIDIA. Last updated on Jan 9, 2023.