Bare Metal Deployment Guide
Bare Metal Deployment Guide (0.1.0)

Installing NVIDIA GPU Operator

NVIDIA AI Enterprise 2.0 or later

NVIDIA AI Enterprise customers have access to a pre-configured GPU Operator within the NVIDIA Enterprise Catalog. The GPU Operator is pre-configured to simplify the provisioning experience with NVIDIA AI Enterprise deployments.

The pre-configured GPU Operator differs from the GPU Operator in the public NGC catalog. The differences are:

  • It is configured to use a prebuilt vGPU driver image (Only available to NVIDIA AI Enterprise customers).

  • It is configured to use the NVIDIA License System (NLS).

Note

The GPU Operator with NVIDIA AI Enterprise requires some tasks to be completed prior to installation. Refer to the document NVIDIA AI Enterprise for instructions prior to running the below commands.

License GPU Operator for CLS

Add the NVIDIA AI Enterprise Helm repository, where api-key is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated.

Copy
Copied!
            

$ helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=api-key && helm repo update

  1. Copy the NLS license token in the file named client_configuration_token.tok.

  2. Create an empty gridd.conf file using the command below.

    Copy
    Copied!
                

    touch gridd.conf


  3. Create Configmap for the NLS Licensing using the command below.

    Copy
    Copied!
                

    kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok


  4. Create K8s Secret to Access NGC registry.

    Copy
    Copied!
                

    kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’


  5. Install the GPU Operator with the command below.

    Copy
    Copied!
                

    $ helm install --wait --generate-name nvaie/gpu-operator -n gpu-operator


License GPU Operator for DLS

Add the NVIDIA AI Enterprise Helm repository, where api-key is the NGC API key for accessing the NVIDIA Enterprise Collection that you generated.

Copy
Copied!
            

$ helm repo add nvidia https://nvidia.github.io/gpu-operator \ && helm repo update


Prior to GPU Operator v1.9, the operator was installed in the default namespace while all operands were installed in the gpu-operator-resources namespace.

Starting with GPU Operator v1.9, both the operator and operands get installed in the same namespace. The namespace is configurable and is determined during installation. For example, to install the GPU Operator in the gpu-operator namespace.

Copy
Copied!
            

$ helm install --wait --generate-name \ -n gpu-operator --create-namespace nvidia/gpu-operator


If a namespace is not specified during installation, all GPU Operator components will be installed in the default namespace.

GPU Operator with RDMA

Prerequisites

After NVIDIA Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules.

Copy
Copied!
            

$ helm install --wait gpu-operator nvaie/gpu-operator -n gpu-operator --set driver.rdma.enabled=true

Please note that the installation of the GPU Operator can take a couple of minutes. How long the installation will take depends on your internet speed.

Copy
Copied!
            

kubectl get pods --all-namespaces | grep -v kube-system

Results:

Copy
Copied!
            

NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-operator-1622656274-node-feature-discovery-master-5cddq96gq 1/1 Running 0 2m39s default gpu-operator-1622656274-node-feature-discovery-worker-wr88v 1/1 Running 0 2m39s default gpu-operator-7db468cfdf-mdrdp 1/1 Running 0 2m39s gpu-operator-resources gpu-feature-discovery-g425f 1/1 Running 0 2m20s gpu-operator-resources nvidia-container-toolkit-daemonset-mcmxj 1/1 Running 0 2m20s gpu-operator-resources nvidia-cuda-validator-s6x2p 0/1 Completed 0 48s gpu-operator-resources nvidia-dcgm-exporter-wtxnx 1/1 Running 0 2m20s gpu-operator-resources nvidia-dcgm-jbz94 1/1 Running 0 2m20s gpu-operator-resources nvidia-device-plugin-daemonset-hzzdt 1/1 Running 0 2m20s gpu-operator-resources nvidia-device-plugin-validator-9nkxq 0/1 Completed 0 17s gpu-operator-resources nvidia-driver-daemonset-kt8g5 1/1 Running 0 2m20s gpu-operator-resources nvidia-operator-validator-cw4j5 1/1 Running 0 2m20s

Please refer to the GPU Operator page on NGC for more information.

Execute the below command to list the Mellanox NIC’s with the status:

Copy
Copied!
            

$ kubectl exec -it $(kubectl get pods -n nvidia-network-operator-resources | grep mofed | awk '{print $1}') -n nvidia-network-operator-resources -- ibdev2netdev

Output:

Copy
Copied!
            

mlx5_0 port 1 ==> ens192f0 (Up) mlx5_1 port 1 ==> ens192f1 (Down)

Edit the networkdefinition.yaml.

Copy
Copied!
            

$ nano networkdefinition.yaml

Create network definition for IPAM and replace the ens192f0 with active Mellanox device for master.

Copy
Copied!
            

apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: rdma/rdma_shared_device_a name: rdma-net-ipam namespace: default spec: config: |- { "cniVersion": "0.3.1", "name": "rdma-net-ipam", "plugins": [ { "ipam": { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "range": "192.168.111.0/24", "type": "whereabouts" }, "type": "macvlan", "master": "ens192f0", "vlan": 111 }, { "mtu": 1500, "type": "tuning" } ] } EOF

Note

If you do not have VLAN-based networking on the high-performance side, please set “vlan”: 0

© Copyright 2022-2023, NVIDIA. Last updated on Jan 9, 2023.