Step #6: Install NVIDIA Operators
VMware offers native TKG support for NVIDIA virtual GPUs on NVIDIA GPU Certified Servers with NVIDIA GPU Operator and NVIDIA Network Operator. Node acceleration achieved by these NVIDIA Operators is based on the Operator Framework. For purposes of this lab, you will only install the NVIDIA GPU Operator. For multi-node training workloads, you should first install the NVIDIA Network Operator followed by the NVIDIA GPU Operator.
The TKG context must be set to the TKG cluster Namespace, not the Supervisor Namespace, to install the NVIDIA Operators. This is achieved by running the command below.
kubectl vsphere login --server=<KUBERNETES-CONTROL-PLANE-IP-ADDRESS> --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-name tkg-cluster --tanzu-kubernetes-cluster-namespace launchpad
Create the NVIDIA GPU Operator Namespace using the command below.
kubectl create namespace gpu-operator
Download the token file with the command below.
$ ngc registry resource download-version "nvlp-aienterprise/licensetoken:1"
NoteThe license will be inside the folder that you just downloaded.
NoteIf needed, you can install the NGC CLI from NGC.
Find the name of your token by using the list command.
$ ls
Create an empty gridd.conf file using the command below.
$ sudo touch gridd.conf
Copy the CLS license token file named
client_configuration_token.tok
into your current working directory.ImportantBefore you begin you will need to generate or use an existing API key.
You received an email from NVIDIA NGC when you were approved for NVIDIA LaunchPad, if you have not done so already, please click on the link within the email to activate the NVIDIA AI Enterprise NGC Catalog.
Create Configmap for the CLS Licensing using the command below.
kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok
From a browser, go to https://ngc.nvidia.com/signin and then enter your email and password.
In the top right corner, click your user account icon and select Setup.
Click Get API key to open the Setup > API Key page.
NoteThe API Key is the mechanism used to authenticate your access to the NGC container registry.
Click Generate API Key to generate your API key. A warning message appears to let you know that your old API key will become invalid if you create a new key.
Click Confirm to generate the key.
Your API key appears.
ImportantYou only need to generate an API Key once. NGC does not save your key, so store it in a secure place. (You can copy your API Key to the clipboard by clicking the copy icon to the right of the API key.) Should you lose your API Key, you can generate a new one from the NGC website. When you generate a new API Key, the old one is invalidated.
Create K8s Secret to Access NGC registry using the command below.
kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator
Add the Helm Repo using the command below.
helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>
Update the Helm Repo with the command below.
helm repo update
Install NVIDIA GPU Operator using the command below.
helm install --wait gpu-operator nvaie/gpu-operator-1-1 -n gpu-operator
VALIDATE THE NVIDIA GPU OPERATOR DEPLOYMENT
Locate the NVIDIA driver daemonset using the command below.
kubectl get pods -n gpu-operator
Locate the pods with the title starting with
nvidia-driver-daemonset-xxxxx
.Run nvidia-smi with the pods found usng the command in Step 1.
sysadmin@sn-vm:~$ kubectl exec -ti -n gpu-operator nvidia-driver-daemonset-sdtvt -- nvidia-smi Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) Thu Jan 27 00:53:35 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID T4-16C On | 00000000:02:00.0 Off | 0 | | N/A N/A P8 N/A / N/A | 2220MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
From the above command, you can see the GPU resources available to the cluster via the GPU operator.
To further experience the power of the GPU operator, you can work through the lab again with various time-sliced GPU resources and change the number of replicas in your tanzucluster.yaml file and apply it to see the pods auto-scale up and down.
You can also create and destroy clusters as needed with kubectl.