Step #6: Install NVIDIA Operators

Optimize AI and Data Science Workloads (VMware Tanzu) (Latest Version)

VMware offers native TKG support for NVIDIA virtual GPUs on NVIDIA GPU Certified Servers with NVIDIA GPU Operator and NVIDIA Network Operator. Node acceleration achieved by these NVIDIA Operators is based on the Operator Framework. For purposes of this lab, you will only install the NVIDIA GPU Operator. For multi-node training workloads, you should first install the NVIDIA Network Operator followed by the NVIDIA GPU Operator.

The TKG context must be set to the TKG cluster Namespace, not the Supervisor Namespace, to install the NVIDIA Operators. This is achieved by running the command below.

Copy
Copied!
            

kubectl vsphere login --server=<KUBERNETES-CONTROL-PLANE-IP-ADDRESS> --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-name tkg-cluster --tanzu-kubernetes-cluster-namespace launchpad

  1. Create the NVIDIA GPU Operator Namespace using the command below.

    Copy
    Copied!
                

    kubectl create namespace gpu-operator


  2. Download the token file with the command below.

    Copy
    Copied!
                

    $ ngc registry resource download-version "nvlp-aienterprise/licensetoken:1"

    Note

    The license will be inside the folder that you just downloaded.

    Note

    If needed, you can install the NGC CLI from NGC.


  3. Find the name of your token by using the list command.

    Copy
    Copied!
                

    $ ls


  4. Create an empty gridd.conf file using the command below.

    Copy
    Copied!
                

    $ sudo touch gridd.conf


  5. Copy the CLS license token file named client_configuration_token.tok into your current working directory.

    Important

    Before you begin you will need to generate or use an existing API key.

    You received an email from NVIDIA NGC when you were approved for NVIDIA LaunchPad, if you have not done so already, please click on the link within the email to activate the NVIDIA AI Enterprise NGC Catalog.


  6. Create Configmap for the CLS Licensing using the command below.

    Copy
    Copied!
                

    kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok


  7. From a browser, go to https://ngc.nvidia.com/signin and then enter your email and password.

  8. In the top right corner, click your user account icon and select Setup.

  9. Click Get API key to open the Setup > API Key page.

    Note

    The API Key is the mechanism used to authenticate your access to the NGC container registry.


  10. Click Generate API Key to generate your API key. A warning message appears to let you know that your old API key will become invalid if you create a new key.

  11. Click Confirm to generate the key.

  12. Your API key appears.

    Important

    You only need to generate an API Key once. NGC does not save your key, so store it in a secure place. (You can copy your API Key to the clipboard by clicking the copy icon to the right of the API key.) Should you lose your API Key, you can generate a new one from the NGC website. When you generate a new API Key, the old one is invalidated.


  13. Create K8s Secret to Access NGC registry using the command below.

    Copy
    Copied!
                

    kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator


  14. Add the Helm Repo using the command below.

    Copy
    Copied!
                

    helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>


  15. Update the Helm Repo with the command below.

    Copy
    Copied!
                

    helm repo update


  16. Install NVIDIA GPU Operator using the command below.

    Copy
    Copied!
                

    helm install --wait gpu-operator nvaie/gpu-operator-1-1 -n gpu-operator


VALIDATE THE NVIDIA GPU OPERATOR DEPLOYMENT

  1. Locate the NVIDIA driver daemonset using the command below.

    Copy
    Copied!
                

    kubectl get pods -n gpu-operator


  2. Locate the pods with the title starting with nvidia-driver-daemonset-xxxxx.

  3. Run nvidia-smi with the pods found usng the command in Step 1.

    Copy
    Copied!
                

    sysadmin@sn-vm:~$ kubectl exec -ti -n gpu-operator nvidia-driver-daemonset-sdtvt -- nvidia-smi Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) Thu Jan 27 00:53:35 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID T4-16C On | 00000000:02:00.0 Off | 0 | | N/A N/A P8 N/A / N/A | 2220MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+


From the above command, you can see the GPU resources available to the cluster via the GPU operator.

To further experience the power of the GPU operator, you can work through the lab again with various time-sliced GPU resources and change the number of replicas in your tanzucluster.yaml file and apply it to see the pods auto-scale up and down.

You can also create and destroy clusters as needed with kubectl.

© Copyright 2022-2023, NVIDIA. Last updated on Apr 13, 2023.