NVIDIA Docs Hub Homepage Optimize AI and Data Science Workloads (VMware Tanzu) Step #6: Install NVIDIA Operators

Step #6: Install NVIDIA Operators

VMware offers native TKG support for NVIDIA virtual GPUs on NVIDIA GPU Certified Servers with NVIDIA GPU Operator and NVIDIA Network Operator. Node acceleration achieved by these NVIDIA Operators is based on the Operator Framework. For purposes of this lab, you will only install the NVIDIA GPU Operator. For multi-node training workloads, you should first install the NVIDIA Network Operator followed by the NVIDIA GPU Operator.

The TKG context must be set to the TKG cluster Namespace, not the Supervisor Namespace, to install the NVIDIA Operators. This is achieved by running the command below.

Copy
Copied!

            
            kubectl vsphere login --server=<KUBERNETES-CONTROL-PLANE-IP-ADDRESS> --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-name tkg-cluster --tanzu-kubernetes-cluster-namespace launchpad

Deploy the NVIDIA GPU Operator

Create the NVIDIA GPU Operator Namespace using the command below.

Copy
Copied!

            
            kubectl create namespace gpu-operator

Download the token file with the command below.
Copy

Copied!
```
            
            $ ngc registry resource download-version "nvlp-aienterprise/licensetoken:1"
        
```
Note

The license will be inside the folder that you just downloaded.

Note

If needed, you can install the NGC CLI from NGC.
Find the name of your token by using the list command.
Copy

Copied!
```
            
            $ ls
        
```

Create an empty gridd.conf file using the command below.

Copy
Copied!

            
            $ sudo touch gridd.conf

Copy the CLS license token file named client_configuration_token.tok into your current working directory.

Important

Before you begin you will need to generate or use an existing API key.

You received an email from NVIDIA NGC when you were approved for NVIDIA LaunchPad, if you have not done so already, please click on the link within the email to activate the NVIDIA AI Enterprise NGC Catalog.

Create Configmap for the CLS Licensing using the command below.

Copy
Copied!

            
            kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok

From a browser, go to https://ngc.nvidia.com/signin and then enter your email and password.
In the top right corner, click your user account icon and select Setup.
Click Get API key to open the Setup > API Key page.

Note

The API Key is the mechanism used to authenticate your access to the NGC container registry.
Click Generate API Key to generate your API key. A warning message appears to let you know that your old API key will become invalid if you create a new key.
Click Confirm to generate the key.
Your API key appears.

Important

You only need to generate an API Key once. NGC does not save your key, so store it in a secure place. (You can copy your API Key to the clipboard by clicking the copy icon to the right of the API key.) Should you lose your API Key, you can generate a new one from the NGC website. When you generate a new API Key, the old one is invalidated.

Create K8s Secret to Access NGC registry using the command below.

Copy
Copied!

            
            kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator

Add the Helm Repo using the command below.

Copy
Copied!

            
            helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>

Update the Helm Repo with the command below.

Copy
Copied!

            
            helm repo update

Install NVIDIA GPU Operator using the command below.

Copy
Copied!

            
            helm install --wait gpu-operator nvaie/gpu-operator-1-1 -n gpu-operator

VALIDATE THE NVIDIA GPU OPERATOR DEPLOYMENT

Locate the NVIDIA driver daemonset using the command below.

Copy
Copied!

            
            kubectl get pods -n gpu-operator

Locate the pods with the title starting with nvidia-driver-daemonset-xxxxx.

Run nvidia-smi with the pods found usng the command in Step 1.

Copy
Copied!

            
            sysadmin@sn-vm:~$ kubectl exec -ti -n gpu-operator nvidia-driver-daemonset-sdtvt -- nvidia-smi
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
Thu Jan 27 00:53:35 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID T4-16C         On   | 00000000:02:00.0 Off |                    0 |
| N/A   N/A    P8    N/A /  N/A |   2220MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

From the above command, you can see the GPU resources available to the cluster via the GPU operator.

To further experience the power of the GPU operator, you can work through the lab again with various time-sliced GPU resources and change the number of replicas in your tanzucluster.yaml file and apply it to see the pods auto-scale up and down.

You can also create and destroy clusters as needed with kubectl.