Installing NVIDIA GPU Operator

NVIDIA AI Enterprise customers have access to a pre-configured GPU Operator within the NVIDIA Enterprise Catalog. The GPU Operator is pre-configured to simplify the provisioning experience with NVIDIA AI Enterprise deployments.

The pre-configured GPU Operator differs from the GPU Operator in the public NGC catalog. The differences are:

It is configured to use a prebuilt vGPU driver image (Only available to NVIDIA AI Enterprise customers).
It is configured to use the NVIDIA License System (NLS).

Install GPU Operator

Note

The GPU Operator with NVIDIA AI Enterprise requires some tasks to be completed prior to installation. Refer to the document NVIDIA AI Enterprise for instructions prior to running the below commands.

Add the NVIDIA AI Enterprise Helm repository, where api-key is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated:

Copy
Copied!

            
            $ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=api-key && helm repo update

Copy
Copied!

            
            $ helm install --wait --generate-name nvaie/gpu-operator -n gpu-operator

License GPU Operator

Copy the NLS license token in the file named client_configuration_token.tok.

Create an empty gridd.conf file.

Copy
Copied!

            
            touch gridd.conf

Create Configmap for the NLS Licensing.

Copy
Copied!

            
            kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok

Create K8s Secret to Access NGC registry.

Copy
Copied!

            
            kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator

GPU Operator with RDMA

Prerequisites

Please install the Network Operator to ensure that the MOFED drivers are installed.

After NVIDIA Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules.

Copy
Copied!

            
            $ helm install --wait gpu-operator nvaie/gpu-operator -n gpu-operator --set driver.rdma.enabled=true

Validate the state of the GPU Operator

Please note that the installation of the GPU Operator can take a couple of minutes. How long the installation will take depends on your internet speed.

Copy
Copied!

            
            kubectl get pods --all-namespaces | grep -v kube-system

Results:

Copy
Copied!

            
            NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
default                  gpu-operator-1622656274-node-feature-discovery-master-5cddq96gq   1/1     Running     0          2m39s
default                  gpu-operator-1622656274-node-feature-discovery-worker-wr88v       1/1     Running     0          2m39s
default                  gpu-operator-7db468cfdf-mdrdp                                     1/1     Running     0          2m39s
gpu-operator-resources   gpu-feature-discovery-g425f                                       1/1     Running     0          2m20s
gpu-operator-resources   nvidia-container-toolkit-daemonset-mcmxj                          1/1     Running     0          2m20s
gpu-operator-resources   nvidia-cuda-validator-s6x2p                                       0/1     Completed   0          48s
gpu-operator-resources   nvidia-dcgm-exporter-wtxnx                                        1/1     Running     0          2m20s
gpu-operator-resources   nvidia-dcgm-jbz94                                                 1/1     Running     0          2m20s
gpu-operator-resources   nvidia-device-plugin-daemonset-hzzdt                              1/1     Running     0          2m20s
gpu-operator-resources   nvidia-device-plugin-validator-9nkxq                              0/1     Completed   0          17s
gpu-operator-resources   nvidia-driver-daemonset-kt8g5                                     1/1     Running     0          2m20s
gpu-operator-resources   nvidia-operator-validator-cw4j5                                   1/1     Running     0          2m20s

Please refer to the GPU Operator page on NGC for more information.

Validating the Network Operator with GPUDirect RDMA

Execute the below command to list the Mellanox NIC’s with the status:

Copy
Copied!

            
            $ kubectl exec -it $(kubectl get pods -n nvidia-network-operator-resources | grep mofed | awk '{print $1}') -n nvidia-network-operator-resources -- ibdev2netdev

Output:

Copy
Copied!

            
            mlx5_0 port 1 ==> ens192f0 (Up)
mlx5_1 port 1 ==> ens192f1 (Down)

Edit the networkdefinition.yaml.

Copy
Copied!

            
            $ nano networkdefinition.yaml

Create network definition for IPAM and replace the ens192f0 with active Mellanox device for master.

Copy
Copied!

            
            apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
    k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a
name: rdma-net-ipam
namespace: default
spec:
config: |-
    {
        "cniVersion": "0.3.1",
        "name": "rdma-net-ipam",
        "plugins": [
            {
                "ipam": {
                    "datastore": "kubernetes",
                    "kubernetes": {
                        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
                    },
                    "log_file": "/tmp/whereabouts.log",
                    "log_level": "debug",
                    "range": "192.168.111.0/24",
                    "type": "whereabouts"
                },
                "type": "macvlan",
                "master": "ens192f0",
                "vlan": 111
            },
            {
                "mtu": 1500,
                "type": "tuning"
            }
        ]
    }
EOF

Note

If you do not have VLAN-based networking on the high-performance side, please set “vlan”: 0