NVIDIA AI Enterprise 1.1 or later
Before you deploy VMware vSphere with VMware Tanzu, you must ensure that the virtual infrastructure in which you are deploying it to meets the prerequisites.
VMware vSphere 7.0 U3c
NVIDIA AI Enterprise 1.1
Ubuntu 20.04 Server ISO
NVIDIA AI Enterprise 1.1 requires access to the NVIDIA AI Enterprise Catalog on NGC and administrator access to vSphere with Tanzu and management network is required.
You will need two separate routable subnets configured at a minimum with three preferred. One subnet will be for Management Networking (ESXi, vCenter, the Supervisor Cluster, and load balancer). The second subnet will be used for Workload Networking (virtual IPs and TKG cluster) and front-end networking if using only two subnets. The third subnet used for front-end networking. Please refer to vSphere with Tanzu Quick Start Guide for additional information.
Prior to proceeding with this guide, vSphere with Tanzu must be installed and configured. The following is an overview of the installation steps.
ESXi is installed on 3 hosts.
ESXi network is configured, each host must have at least 2 NICs configured.
VMware VCSA is installed and on the same network as ESXi hosts. Configured as follows:
Cluster is created with HA and DRS enabled.
Hosts are added to the cluster.
Shared storage or vSAN is created in datastore.
VDS is configured.
Storage is configured.
HAProxy (load balancer) is installed and configured.
Enable Workload Management.
Namespace is created and configured.
All GPU are configured with Shared Direct mode in vSphere.
Once vSphere with Tanzu has been successfully installed and configured, the following NVIDIA AI Enterprise software must be installed on each of the 3 ESXi hosts.
NVIDIA AI Enterprise Host (VIB) and Guest Driver Software 1.1 are pulled from NVIDIA AI Enterprise Catalog on NGC.
Once the server is configured with vSphere with Tanzu and NVIDIA AI Enterprise. You will need to create and deploy a TKG cluster. This document also assumes that a TKG namespace, TKG cluster content library and TKG cluster are already created and running.
When vSphere with Tanzu clusters are running in the data center, there are various tasks that are executed by different personas within the enterprise. vSphere IT Administrators will start the initial provisioning of environment by creating required components that are associated with NVIDIA vGPU devices. Once this initial provisioning is complete, the DevOps Engineer will setup and interact with kubectl as well as install the required NV AI Enterprise elements such as NVIDIA GPU and NVIDIA Network Operators. The following graphic illustrates the tasks executed by the vSphere IT Administrators and DevOps Engineers, highlighted in green are the steps which will be covered within this guide.

The following sections discuss these workflows by each persona as well as the required steps in detail to provision a vSphere with Tanzu GPU accelerated cluster with NVIDIA AI Enterprise.
Step #1: Create VM Classes
Create a GPU Accelerated Classes
A VM class is a request for resource reservations on the VM for processing power (CPU and GPU). For example, guaranteed-large with 8 vCPU and NVIDIA T4 GPU. To size Tanzu Kubernetes cluster nodes, you specify the virtual machine class. vSphere with Tanzu provides default classes or you can create your own. Use the following instructions for creating a GPU accelerated VM class.
Log-in to vCenter with administrator access.
From vCenter, navigate to Workload Management.
Select Services and Manage under the VM Service card.
Select VM Classes and the Create VM Class card.
Enter a name for the VM class, such as vm-class-t4-16gb.
ImportantUsers interacting with the VM class via the Kubernetes CLI, will not be able to easily see what kind of GPU is attached to the associated node, nor the GPU memory made available. Therefore, a descriptive name that includes the GPU type and associated GPU memory should be used.
NoteVM Classes can configured to use GPU Partitioning. GPU partitioning is available using either NVIDIA AI Enterprise software partitioning or Multi-Instance GPU (MIG). The steps below illustrate how to create a VM class using NVIDIA AI Enterprise software partitioning. If you would like to create a VM class using MIG, please follow the steps in the Create a VM Class Using MIG section.
Select Next.
Select the ADD PCI DEVICE drop down and select NVIDIA vGPU.
Select a GPU model from the drop down.
NoteAll GPU within in any host attached to the Tanzu cluster will be available.
Using the information specified in the name of the VM class, populate the available options for the selected GPU type:
GPU Sharing – Time Sharing
GPU Mode – Compute
NoteThere are two class reservation types: guaranteed and best effort. Guaranteed class fully reserves its configured resources. Best effort class allows resources to be overcommitted. Within a production environment, typically the guaranteed class type is used.
Click Next.
Review the information on the Review and Confirm Page and click Finish.
You have successfully created a VM class using VM Classes with NVIDIA AI Enterprise software partitioning.
Create a VM Class Using MIG
The following steps illustrate how to create a VM class using MIG. Not all NVIDIA GPUs support MIG, MIG is support is available on subset of NVIDIA Ampere GPUs such as A100 or A30.
In order to create a VM class with MIG Partitioning, you first need to configure the GPU to use MIG.
Enter a name for the VM class, such as vm-class-a30-24gb.
Select Next.
Select the ADD PCI DEVICE drop down and select NVIDIA vGPU.
Select a GPU model from the drop down.
NoteAll GPU within in any host attached to the Tanzu cluster will be available.
When adding PCI Devices, click on Multi-Instance GPU Sharing from the GPU Sharing dropdown.
Using the information specified in the name of the VM class, populate the available options for the selected GPU type.
GPU Sharing – Multi-Instance GPU Sharing
GPU Mode – Compute
Select the amount of GPU Partitioned slices you would like to allocate to the VM.
NoteThe values listed in the GPU Memory dropdown are specific to the NVIDIA GPU. For example, NVIDIA A30 which has the following.
4 GPU instances @ 6GB each
2 GPU instances @ 12GB each
1 GPU instance @ 24GB
If you enter a GPU Memory value that does not equal a valid GPU profile, the resulting VM will utilize the next smallest profile. E.g., if you are using an A30 and choose 9GB, the resulting VM will have a 12GB profile.
Click Next.
Review the information on the Review and Confirm page and click Finish.
You have successfully created a VM class using VM Classes with NVIDIA MIG partitioning.
Create a VM Class Using NVIDIA Networking
VM Classes can also be used with NVIDIA Networking. NVIDIA Networking cards can be added to VM classes as PCI devices. You can use VM classes with NVIDIA networking without GPUs. In the next few steps, we will add NVIDIA Cx6 Networking card after already configuring an NVIDIA vGPU.
NVIDIA networking cards can be used in SR-IOV mode. This requires additional setup and may not be needed for the average deployment. See How-to: Configure NVIDIA ConnectX-5/6 adapter in SR-IOV mode on VMware ESXi 6.7/7.0 and above for additional information.
Add a Dynamic DirectPath IO device.
In the Select the Hardware drop down select the available ConnectC-6 DX device, or the NVIDIA networking card available in your hosts.
Click Next and review the Confirmation window.
Click Finish.
Step #2: Associate the VM Class with Supervisor Namespace
Now that you have created a VM class, next we will associate it with the Supervisor Namespace. A VM class can be added to one or more namespaces on a Supervisor Cluster, and a Supervisor Cluster can have one or more VM Classes. For most deployments, the Supervisor Namespace will have multiple VM classes to properly scale Kubernetes clusters.
This document assumes that a Supervisor Namespace and Content Library are already created and running. For demo purposes, we created a Supervisor Namespace called tkg-ns.
From vCenter, navigate to Workload Management.
Expand your Tanzu Cluster and associated namespace.
Select the Namespace and click on Manage VM Class in the VM Service card.
From the Manage VM Classes pop up, select the check box next to the VM class which you previously created. You can select one or many VM classes, dependent on how you choose to architect your deployment.
The VM class, vm-class-t4-16gb (Create a GPU Accelerated Classes) is listed below.
The VM class, vm-class-a30-24gb (Create a VM Class Using MIG) is listed below.
The VM class, nvidia-a30-24c-cx6 (Create a VM Class Using NVIDIA Networking) is listed below.
Click OK.
Validate that the content library is associated with the supervisor namespace; click on Add Content Library in the VM Service card.
From the Add Content Library pop up, select the check box for the Subscribed Content Library, which will contain the VM Template to be used by NVIDIA AI Enterprise.
The VM Template which is used by NVIDIA AI Enterprise is provided by VMware within the subscriber content library.
DevOps Engineer(s) need access to the VM class to deploy a Tanzu Kubernetes cluster in the newly created namespace. vSphere IT Administrators must explicitly associate VM classes to any new namespaces where Tanzu Kubernetes cluster is deployed.
Step #1: Install Kubernetes CLI Tools
The DevOps Engineer will install the Kubernetes CLI Tools on a VM to interact with the Tanzu Kubernetes Grid cluster. This requires network access to the Tanzu cluster. Instruction for downloading and installing the Kubernetes CLI Tools for vSphere can be found here.
TKG Clusters and VMs are created and destroyed via the Kubernetes CLI Tools. Therefore, AI Practitioners may interact with this tool as well.
Once the Kubernetes CLI Tools have been downloaded and installed, execute the following steps to log into the master server and set the context.
The Kubernetes CLI Tools download page can be accessed from your environment via a browser and navigating to the IP address of a supervisor cluster VM.
Verify that the SHA256 checksum of vsphere*-plugin.zip matches the checksum in the provided file sha256sum.txt by running the command below in Powershell.
Get-FileHash -Algorithm SHA256 -Path vsphere*-plugin.zip
NoteThe command above is valid for the Microsoft Windows version of Kubernetes CLI Tool.
Verify OK in the results.
Put the contents of the .zip file in your OS’s executable search path.
Run the command below to log in to the server.
kubectl vsphere login --server=<IP_or_master_hostname>
Run the command below to view a list of your Namespaces.
kubectl config get-contexts
Run the command below to choose your default context.
kubectl config use-context <context>
Use the commands to see your cluster’s existing nodes and pods.
kubectl get nodes kubectl get pods -A
Step #2: Create a GPU Accelerated TKG Cluster
We will create a YAML file to create a GPU accelerated cluster within this document. This file contains the new TKG Cluster Name, the previously specified Supervisor Namespace, and VM class.
It is recommended to create enough space on the containerd storage for this cluster as containers will be stored.
List all VMclass instances associated with that namespace using the command below.
kubectl get virtualmachineclasses
View GPU resources for a specific class using the command below.
kubectl describe virtualmachineclass <VMclass-name>
Create a YAML file with the appropriate configuration for your VM class.
nano tanzucluster.yaml
Populate the YAML file with the information below.
WarningThe following YAML file can only be used for vSphere 7.0 U3C
apiVersion: run.tanzu.vmware.com/v1alpha2 kind: TanzuKubernetesCluster metadata: name: tkg-a30-cx6 namespace: tkg-ns spec: topology: controlPlane: replicas: 3 vmClass: guaranteed-medium storageClass: kubernetes-demo-storage nodePools: - name: nodepool-a30-cx6 replicas: 2 vmClass: nvidia-a30-24c-cx6 storageClass: kubernetes-demo-storage volumes: - name: containerd mountPath: /var/lib/containerd capacity: storage: 100Gi distribution: fullVersion: 1.20.8+vmware.1-tkg.2 settings: storage: defaultClass: kubernetes-demo-storage network: cni: name: antrea services: cidrBlocks: ["198.51.100.0/12"] pods: cidrBlocks: ["192.0.2.0/16"] serviceDomain: local
NVIDIA AI Enterprise 3.0 or later
EXAMPLE YAML FOR VSPHERE 8.0
VMware vSphere 8.0 requires an updated TKR which uses an Ubuntu OS for cluster nodes. To create a cluster for NVIDIA AI Enterprise 3.0 leverage this updated example yaml file as a starting point for cluster creation.
apiVersion: run.tanzu.vmware.com/v1alpha3 kind: TanzuKubernetesCluster metadata: name: tme-emo namespace: tkg-ns annotations: run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu spec: topology: controlPlane: replicas: 1 vmClass: guaranteed-medium storageClass: kubernetes-demo-storage tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable nodePools: - name: nodepool-test replicas: 2 vmClass: nvidia-a30-24c storageClass: kubernetes-demo-storage volumes: - name: containerd mountPath: /var/lib/containerd capacity: storage: 200Gi tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable settings: storage: defaultClass: kubernetes-demo-storage network: cni: name: antrea services: cidrBlocks: ["198.51.100.0/12"] pods: cidrBlocks: ["192.0.2.0/16"] serviceDomain: managedcluster.local
NoteAdditional details can be found here, v1alpha3 Example: TKC with Ubuntu TKR
NoteThe name of the OS image given in the “distribution - fullVersion” section of the YAML must match one of the entries seen when you do a “kubectl get tkr” at the Supervisor Cluster level (tkr is tanzu Kubernetes releases). That entry is placed there when you associate the Content Library with the namespace.
Apply the YAML to create the TKG cluster using the command below.
kubectl apply -f tanzucluster.yaml
Execute the command below to see the status of the cluster.
kubectl get tkc
Wait until the cluster is ready.
When the cluster is ready, the IT Administrator will be able to see the cluster created in the vCenter UI.
If you want to SSH to the Tanzu nodes, please follow the SSH to Tanzu Kubernetes Cluster Nodes as the System User Using a Password.
Step #3: Install NVIDIA Operators
VMware offers native TKG support for NVIDIA virtual GPUs on NVIDIA GPU Certified Servers with NVIDIA GPU Operator and NVIDIA Network Operator. Node acceleration achieved by these NVIDIA Operators is based on the Operator Framework. We will first install the NVIDIA Network Operator and then the NVIDIA GPU Operator to fully unlock GPU Direct RDMA capabilities.
In order to install the NVIDIA Operators, the TKG context must be set to the TKG cluster namespace (not the supervisor namespace). This is achieved by running the command below.
kubectl vsphere login --server=<Server-IP> --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-name tkg-a30-cx6 --tanzu-kubernetes-cluster-namespace tkg-ns
It is essential to install NVIDIA operators to ensure that the MOFED drivers are in place.
Deploy NVIDIA Network Operator
The default installation via Helm, as described below, will deploy the NVIDIA Network Operator and related CRDs. An additional step is required to create a NicClusterPolicy custom resource with the desired configuration for the cluster. Please refer to the NicClusterPolicy CRD Section for more manual Custom Resource creation information.
The provided Helm chart contains various parameters to facilitate the creation of a NicClusterPolicy custom resource upon deployment. Refer to the NVIDIA Network Operator Helm Chart README for a full list of chart parameters.
Each NVIDIA Operator release has a set of default version values for the various components it deploys. It is recommended that these values will not be changed. Testing and validation were performed with these values, and there is no guarantee of interoperability nor correctness when different versions are used.
Fetch NVIDIA Network Operator Helm Chart.
helm fetch https://helm.ngc.nvidia.com/nvaie/charts/network-operator-v1.1.0.tgz --username='$oauthtoken' --password=<YOUR API KEY> --untar
Create a YAML file with the appropriate configuratio.
nano values.yaml
Populate the YAML file with the information below.
deployCR: true ofedDriver: deploy: true rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a vendors: [15b3] devices: [ens192]
Install NVIDIA Network Operator using the command below.
helm install network-operator -f ./values.yaml -n network-operator --create-namespace --wait network-operator/
Deploy NVIDIA GPU Operator
Create NVIDIA GPU Operator Namespace.
kubectl create namespace gpu-operator
Copy the CLS license token in the file named
client_configuration_token.tok
.Create an empty gridd.conf file.
sudo touch gridd.conf
Create Configmap for the CLS Licensing.
kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok
Create K8s Secret to Access NGC registry.
kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator
Add the Helm Repo.
helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>
Update the Helm Repo.
helm repo update
Install NVIDIA GPU Operator.
helm install --wait gpu-operator nvaie/gpu-operator-1-1 -n gpu-operator
Validate NVIDIA GPU Operator Deployment
Locate the NVIDIA driver daemonset using the command below.
kubectl get pods -n gpu-operator
Locate the pods with the title starting with
nvidia-driver-daemonset-xxxxx
.Run nvidia-smi with the pods found above.
sysadmin@sn-vm:~$ kubectl exec -ti -n gpu-operator nvidia-driver-daemonset-sdtvt -- nvidia-smi Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) Thu Jan 27 00:53:35 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID T4-16C On | 00000000:02:00.0 Off | 0 | | N/A N/A P8 N/A / N/A | 2220MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+