Installing NVIDIA Network Operator

NVIDIA AI Enterprise 2.0 or later

Prerequisites

Note

If Mellanox NICs are not connected to your nodes, please skip this step and proceed to Installing NVIDIA GPU Operator section.

The below instructions assume that Mellanox NICs are connected to your machines.

Execute the below command to verify Mellanox NICs are enabled on your machines:

Copy
Copied!

            
            $ lspci | grep -i "Mellanox"

Output:

Copy
Copied!

            
            0c:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
0c:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

Execute the below command to know which Mellanox Device is Active:

Note

Use the Device whichever shows as Link Detected: yes in further steps. Below command works only if you add the NICs before installing the Operating System.

Copy
Copied!

            
            for device in `sudo lshw -class network -short | grep -i ConnectX | awk '{print $2}' | egrep -v 'Device|path' | sed '/^$/d'`;do echo -n $device; sudo ethtool $device | grep -i "Link detected"; done

Output:

Copy
Copied!

            
            ens160f0        Link detected: yes
ens160f1        Link detected: no

Create the custom network operator values.yaml.

Copy
Copied!

            
            $ nano network-operator-values.yaml

Update the active Mellanox device from the above command.

Copy
Copied!

            
            deployCR: true
ofedDriver:
deploy: true
nvPeerDriver:
deploy: true
rdmaSharedDevicePlugin:
deploy: true
resources:
    - name: rdma_shared_device_a
    vendors: [15b3]
    devices: [ens160f0]

For more information about custom network operator values.yaml, please refer Network Operator.

Add the NVIDIA repo:

Note

Installing Helm is required to install GPU Operator.

Copy
Copied!

            
            $ helm repo add mellanox https://mellanox.github.io/network-operator

Update the Helm repo:

Copy
Copied!

            
            $ helm repo update

Install NVIDIA Network Operator

Execute the commands below:

Copy
Copied!

            
            $ kubectl label nodes --all node-role.kubernetes.io/master- --overwrite
$ helm install -f ./network-operator-values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator

Validating the State of Network Operator

Please note that the installation of the Network Operator can take a couple of minutes. How long the installation will take depends on your internet speed.

Copy
Copied!

            
            kubectl get pods --all-namespaces | egrep 'network-operator|nvidia-network-operator-resources'

Copy
Copied!

            
            NAMESPACE                           NAME                                                              READY   STATUS      RESTARTS   AGE
network-operator                    network-operator-547cb8d999-mn2h9                                 1/1     Running            0          17m
network-operator                    network-operator-node-feature-discovery-master-596fb8b7cb-qrmvv   1/1     Running            0          17m
network-operator                    network-operator-node-feature-discovery-worker-qt5xt              1/1     Running            0          17m
nvidia-network-operator-resources   cni-plugins-ds-dl5vl                                              1/1     Running            0          17m
nvidia-network-operator-resources   kube-multus-ds-w82rv                                              1/1     Running            0          17m
nvidia-network-operator-resources   mofed-ubuntu20.04-ds-xfpzl                                        1/1     Running            0          17m
nvidia-network-operator-resources   rdma-shared-dp-ds-2hgb6                                           1/1     Running            0          17m
nvidia-network-operator-resources   sriov-device-plugin-ch7bz                                         1/1     Running            0          10m
nvidia-network-operator-resources   whereabouts-56ngr                                                 1/1     Running            0          10m

Please refer to the Network Operator page for more information.