Installing VMware vSphere with VMware Tanzu

NVIDIA AI Enterprise 1.1 or later

Before you deploy VMware vSphere with VMware Tanzu, you must ensure that the virtual infrastructure in which you are deploying it to meets the prerequisites.

Note

NVIDIA AI Enterprise 1.1 requires access to the NVIDIA AI Enterprise Catalog on NGC and administrator access to vSphere with Tanzu and management network is required.

Important

You will need two separate routable subnets configured at a minimum with three preferred. One subnet will be for Management Networking (ESXi, vCenter, the Supervisor Cluster, and load balancer). The second subnet will be used for Workload Networking (virtual IPs and TKG cluster) and front-end networking if using only two subnets. The third subnet used for front-end networking. Please refer to vSphere with Tanzu Quick Start Guide for additional information.

Prior to proceeding with this guide, vSphere with Tanzu must be installed and configured. The following is an overview of the installation steps.

  1. ESXi is installed on 3 hosts.

    • ESXi network is configured, each host must have at least 2 NICs configured.

  2. VMware VCSA is installed and on the same network as ESXi hosts. Configured as follows:

    • Cluster is created with HA and DRS enabled.

    • Hosts are added to the cluster.

    • Shared storage or vSAN is created in datastore.

  3. VDS is configured.

  4. Storage is configured.

  5. HAProxy (load balancer) is installed and configured.

  6. Enable Workload Management.

  7. Namespace is created and configured.

  8. All GPU are configured with Shared Direct mode in vSphere.

Once vSphere with Tanzu has been successfully installed and configured, the following NVIDIA AI Enterprise software must be installed on each of the 3 ESXi hosts.

Note

NVIDIA AI Enterprise Host (VIB) and Guest Driver Software 1.1 are pulled from NVIDIA AI Enterprise Catalog on NGC.

Once the server is configured with vSphere with Tanzu and NVIDIA AI Enterprise. You will need to create and deploy a TKG cluster. This document also assumes that a TKG namespace, TKG cluster content library and TKG cluster are already created and running.

When vSphere with Tanzu clusters are running in the data center, there are various tasks that are executed by different personas within the enterprise. vSphere IT Administrators will start the initial provisioning of environment by creating required components that are associated with NVIDIA vGPU devices. Once this initial provisioning is complete, the DevOps Engineer will setup and interact with kubectl as well as install the required NV AI Enterprise elements such as NVIDIA GPU and NVIDIA Network Operators. The following graphic illustrates the tasks executed by the vSphere IT Administrators and DevOps Engineers, highlighted in green are the steps which will be covered within this guide.

vmware-tanzu-02.png

The following sections discuss these workflows by each persona as well as the required steps in detail to provision a vSphere with Tanzu GPU accelerated cluster with NVIDIA AI Enterprise.

Step #1: Create VM Classes

Create a GPU Accelerated Classes

A VM class is a request for resource reservations on the VM for processing power (CPU and GPU). For example, guaranteed-large with 8 vCPU and NVIDIA T4 GPU. To size Tanzu Kubernetes cluster nodes, you specify the virtual machine class. vSphere with Tanzu provides default classes or you can create your own. Use the following instructions for creating a GPU accelerated VM class.

  1. Log-in to vCenter with administrator access.

  2. From vCenter, navigate to Workload Management.

    vmware-tanzu-03.png


  3. Select Services and Manage under the VM Service card.

    vmware-tanzu-04.png


  4. Select VM Classes and the Create VM Class card.

    vmware-tanzu-05.png


  5. Enter a name for the VM class, such as vm-class-t4-16gb.

    vmware-tanzu-06.png

    Important

    Users interacting with the VM class via the Kubernetes CLI, will not be able to easily see what kind of GPU is attached to the associated node, nor the GPU memory made available. Therefore, a descriptive name that includes the GPU type and associated GPU memory should be used.

    Note

    VM Classes can configured to use GPU Partitioning. GPU partitioning is available using either NVIDIA AI Enterprise software partitioning or Multi-Instance GPU (MIG). The steps below illustrate how to create a VM class using NVIDIA AI Enterprise software partitioning. If you would like to create a VM class using MIG, please follow the steps in the Create a VM Class Using MIG section.


  6. Select Next.

  7. Select the ADD PCI DEVICE drop down and select NVIDIA vGPU.

    vmware-tanzu-07.png


  8. Select a GPU model from the drop down.

    vmware-tanzu-08.png

    Note

    All GPU within in any host attached to the Tanzu cluster will be available.


  9. Using the information specified in the name of the VM class, populate the available options for the selected GPU type:

    • GPU Sharing – Time Sharing

    • GPU Mode – Compute

    vmware-tanzu-09.png

    Note

    There are two class reservation types: guaranteed and best effort. Guaranteed class fully reserves its configured resources. Best effort class allows resources to be overcommitted. Within a production environment, typically the guaranteed class type is used.


  10. Click Next.

  11. Review the information on the Review and Confirm Page and click Finish.

    vmware-tanzu-10.png


You have successfully created a VM class using VM Classes with NVIDIA AI Enterprise software partitioning.

Create a VM Class Using MIG

The following steps illustrate how to create a VM class using MIG. Not all NVIDIA GPUs support MIG, MIG is support is available on subset of NVIDIA Ampere GPUs such as A100 or A30.

Note

In order to create a VM class with MIG Partitioning, you first need to configure the GPU to use MIG.

  1. Enter a name for the VM class, such as vm-class-a30-24gb.

    vmware-tanzu-11.png


  2. Select Next.

  3. Select the ADD PCI DEVICE drop down and select NVIDIA vGPU.

    vmware-tanzu-12.png


  4. Select a GPU model from the drop down.

    vmware-tanzu-13.png

    Note

    All GPU within in any host attached to the Tanzu cluster will be available.


  5. When adding PCI Devices, click on Multi-Instance GPU Sharing from the GPU Sharing dropdown.

    vmware-tanzu-14.png


  6. Using the information specified in the name of the VM class, populate the available options for the selected GPU type.

    • GPU Sharing – Multi-Instance GPU Sharing

    • GPU Mode – Compute

    vmware-tanzu-15.png


  7. Select the amount of GPU Partitioned slices you would like to allocate to the VM.

    Note

    The values listed in the GPU Memory dropdown are specific to the NVIDIA GPU. For example, NVIDIA A30 which has the following.

    • 4 GPU instances @ 6GB each

    • 2 GPU instances @ 12GB each

    • 1 GPU instance @ 24GB

    If you enter a GPU Memory value that does not equal a valid GPU profile, the resulting VM will utilize the next smallest profile. E.g., if you are using an A30 and choose 9GB, the resulting VM will have a 12GB profile.


  8. Click Next.

  9. Review the information on the Review and Confirm page and click Finish.

    vmware-tanzu-16.png


You have successfully created a VM class using VM Classes with NVIDIA MIG partitioning.

Create a VM Class Using NVIDIA Networking

VM Classes can also be used with NVIDIA Networking. NVIDIA Networking cards can be added to VM classes as PCI devices. You can use VM classes with NVIDIA networking without GPUs. In the next few steps, we will add NVIDIA Cx6 Networking card after already configuring an NVIDIA vGPU.

Note

NVIDIA networking cards can be used in SR-IOV mode. This requires additional setup and may not be needed for the average deployment. See How-to: Configure NVIDIA ConnectX-5/6 adapter in SR-IOV mode on VMware ESXi 6.7/7.0 and above for additional information.

  1. Add a Dynamic DirectPath IO device.

    vmware-tanzu-21.png


  2. In the Select the Hardware drop down select the available ConnectC-6 DX device, or the NVIDIA networking card available in your hosts.

    vmware-tanzu-22.png


  3. Click Next and review the Confirmation window.

  4. Click Finish.

    vmware-tanzu-23.png


Step #2: Associate the VM Class with Supervisor Namespace

Now that you have created a VM class, next we will associate it with the Supervisor Namespace. A VM class can be added to one or more namespaces on a Supervisor Cluster, and a Supervisor Cluster can have one or more VM Classes. For most deployments, the Supervisor Namespace will have multiple VM classes to properly scale Kubernetes clusters.

Note

This document assumes that a Supervisor Namespace and Content Library are already created and running. For demo purposes, we created a Supervisor Namespace called tkg-ns.

  1. From vCenter, navigate to Workload Management.

  2. Expand your Tanzu Cluster and associated namespace.

  3. Select the Namespace and click on Manage VM Class in the VM Service card.

    vmware-tanzu-17.png


  4. From the Manage VM Classes pop up, select the check box next to the VM class which you previously created. You can select one or many VM classes, dependent on how you choose to architect your deployment.

    The VM class, vm-class-t4-16gb (Create a GPU Accelerated Classes) is listed below.

    vmware-tanzu-18.png

    The VM class, vm-class-a30-24gb (Create a VM Class Using MIG) is listed below.

    vmware-tanzu-19.png

    The VM class, nvidia-a30-24c-cx6 (Create a VM Class Using NVIDIA Networking) is listed below.

    vmware-tanzu-24.png


  5. Click OK.

  6. Validate that the content library is associated with the supervisor namespace; click on Add Content Library in the VM Service card.

    vmware-tanzu-20.png


  7. From the Add Content Library pop up, select the check box for the Subscribed Content Library, which will contain the VM Template to be used by NVIDIA AI Enterprise.

Note

The VM Template which is used by NVIDIA AI Enterprise is provided by VMware within the subscriber content library.

Important

DevOps Engineer(s) need access to the VM class to deploy a Tanzu Kubernetes cluster in the newly created namespace. vSphere IT Administrators must explicitly associate VM classes to any new namespaces where Tanzu Kubernetes cluster is deployed.

Step #1: Install Kubernetes CLI Tools

The DevOps Engineer will install the Kubernetes CLI Tools on a VM to interact with the Tanzu Kubernetes Grid cluster. This requires network access to the Tanzu cluster. Instruction for downloading and installing the Kubernetes CLI Tools for vSphere can be found here.

Note

TKG Clusters and VMs are created and destroyed via the Kubernetes CLI Tools. Therefore, AI Practitioners may interact with this tool as well.

Once the Kubernetes CLI Tools have been downloaded and installed, execute the following steps to log into the master server and set the context.

Note

The Kubernetes CLI Tools download page can be accessed from your environment via a browser and navigating to the IP address of a supervisor cluster VM.

  1. Verify that the SHA256 checksum of vsphere*-plugin.zip matches the checksum in the provided file sha256sum.txt by running the command below in Powershell.

    Copy
    Copied!
                

    Get-FileHash -Algorithm SHA256 -Path vsphere*-plugin.zip

    Note

    The command above is valid for the Microsoft Windows version of Kubernetes CLI Tool.


  2. Verify OK in the results.

  3. Put the contents of the .zip file in your OS’s executable search path.

  4. Run the command below to log in to the server.

    Copy
    Copied!
                

    kubectl vsphere login --server=<IP_or_master_hostname>


  5. Run the command below to view a list of your Namespaces.

    Copy
    Copied!
                

    kubectl config get-contexts


  6. Run the command below to choose your default context.

    Copy
    Copied!
                

    kubectl config use-context <context>


  7. Use the commands to see your cluster’s existing nodes and pods.

    Copy
    Copied!
                

    kubectl get nodes kubectl get pods -A


Step #2: Create a GPU Accelerated TKG Cluster

We will create a YAML file to create a GPU accelerated cluster within this document. This file contains the new TKG Cluster Name, the previously specified Supervisor Namespace, and VM class.

Note

It is recommended to create enough space on the containerd storage for this cluster as containers will be stored.

  1. List all VMclass instances associated with that namespace using the command below.

    Copy
    Copied!
                

    kubectl get virtualmachineclasses


  2. View GPU resources for a specific class using the command below.

    Copy
    Copied!
                

    kubectl describe virtualmachineclass <VMclass-name>


  3. Create a YAML file with the appropriate configuration for your VM class.

    Copy
    Copied!
                

    nano tanzucluster.yaml


  4. Populate the YAML file with the information below.

    Warning

    The following YAML file can only be used for vSphere 7.0 U3C

    Copy
    Copied!
                

    apiVersion: run.tanzu.vmware.com/v1alpha2 kind: TanzuKubernetesCluster metadata: name: tkg-a30-cx6 namespace: tkg-ns spec: topology: controlPlane: replicas: 3 vmClass: guaranteed-medium storageClass: kubernetes-demo-storage nodePools: - name: nodepool-a30-cx6 replicas: 2 vmClass: nvidia-a30-24c-cx6 storageClass: kubernetes-demo-storage volumes: - name: containerd mountPath: /var/lib/containerd capacity: storage: 100Gi distribution: fullVersion: 1.20.8+vmware.1-tkg.2 settings: storage: defaultClass: kubernetes-demo-storage network: cni: name: antrea services: cidrBlocks: ["198.51.100.0/12"] pods: cidrBlocks: ["192.0.2.0/16"] serviceDomain: local

    NVIDIA AI Enterprise 3.0 or later

    EXAMPLE YAML FOR VSPHERE 8.0

    VMware vSphere 8.0 requires an updated TKR which uses an Ubuntu OS for cluster nodes. To create a cluster for NVIDIA AI Enterprise 3.0 leverage this updated example yaml file as a starting point for cluster creation.

    Copy
    Copied!
                

    apiVersion: run.tanzu.vmware.com/v1alpha3 kind: TanzuKubernetesCluster metadata: name: tme-emo namespace: tkg-ns annotations: run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu spec: topology: controlPlane: replicas: 1 vmClass: guaranteed-medium storageClass: kubernetes-demo-storage tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable nodePools: - name: nodepool-test replicas: 2 vmClass: nvidia-a30-24c storageClass: kubernetes-demo-storage volumes: - name: containerd mountPath: /var/lib/containerd capacity: storage: 200Gi tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable settings: storage: defaultClass: kubernetes-demo-storage network: cni: name: antrea services: cidrBlocks: ["198.51.100.0/12"] pods: cidrBlocks: ["192.0.2.0/16"] serviceDomain: managedcluster.local

    Note

    Additional details can be found here, v1alpha3 Example: TKC with Ubuntu TKR

    Note

    The name of the OS image given in the “distribution - fullVersion” section of the YAML must match one of the entries seen when you do a “kubectl get tkr” at the Supervisor Cluster level (tkr is tanzu Kubernetes releases). That entry is placed there when you associate the Content Library with the namespace.


  5. Apply the YAML to create the TKG cluster using the command below.

    Copy
    Copied!
                

    kubectl apply -f tanzucluster.yaml


  6. Execute the command below to see the status of the cluster.

    Copy
    Copied!
                

    kubectl get tkc


  7. Wait until the cluster is ready.

    vmware-tanzu-25.png


  8. When the cluster is ready, the IT Administrator will be able to see the cluster created in the vCenter UI.

    vmware-tanzu-26.png


Note

If you want to SSH to the Tanzu nodes, please follow the SSH to Tanzu Kubernetes Cluster Nodes as the System User Using a Password.

Step #3: Install NVIDIA Operators

VMware offers native TKG support for NVIDIA virtual GPUs on NVIDIA GPU Certified Servers with NVIDIA GPU Operator and NVIDIA Network Operator. Node acceleration achieved by these NVIDIA Operators is based on the Operator Framework. We will first install the NVIDIA Network Operator and then the NVIDIA GPU Operator to fully unlock GPU Direct RDMA capabilities.

In order to install the NVIDIA Operators, the TKG context must be set to the TKG cluster namespace (not the supervisor namespace). This is achieved by running the command below.

Copy
Copied!
            

kubectl vsphere login --server=<Server-IP> --vsphere-username administrator@vsphere.local --insecure-skip-tls-verify --tanzu-kubernetes-cluster-name tkg-a30-cx6 --tanzu-kubernetes-cluster-namespace tkg-ns

Note

It is essential to install NVIDIA operators to ensure that the MOFED drivers are in place.

Deploy NVIDIA Network Operator

The default installation via Helm, as described below, will deploy the NVIDIA Network Operator and related CRDs. An additional step is required to create a NicClusterPolicy custom resource with the desired configuration for the cluster. Please refer to the NicClusterPolicy CRD Section for more manual Custom Resource creation information.

The provided Helm chart contains various parameters to facilitate the creation of a NicClusterPolicy custom resource upon deployment. Refer to the NVIDIA Network Operator Helm Chart README for a full list of chart parameters.

Each NVIDIA Operator release has a set of default version values for the various components it deploys. It is recommended that these values will not be changed. Testing and validation were performed with these values, and there is no guarantee of interoperability nor correctness when different versions are used.

  1. Fetch NVIDIA Network Operator Helm Chart.

    Copy
    Copied!
                

    helm fetch https://helm.ngc.nvidia.com/nvaie/charts/network-operator-v1.1.0.tgz --username='$oauthtoken' --password=<YOUR API KEY> --untar


  2. Create a YAML file with the appropriate configuratio.

    Copy
    Copied!
                

    nano values.yaml


  3. Populate the YAML file with the information below.

    Copy
    Copied!
                

    deployCR: true ofedDriver: deploy: true rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a vendors: [15b3] devices: [ens192]


  4. Install NVIDIA Network Operator using the command below.

    Copy
    Copied!
                

    helm install network-operator -f ./values.yaml -n network-operator --create-namespace --wait network-operator/


Deploy NVIDIA GPU Operator

  1. Create NVIDIA GPU Operator Namespace.

    Copy
    Copied!
                

    kubectl create namespace gpu-operator


  2. Copy the CLS license token in the file named client_configuration_token.tok.

  3. Create an empty gridd.conf file.

    Copy
    Copied!
                

    sudo touch gridd.conf


  4. Create Configmap for the CLS Licensing.

    Copy
    Copied!
                

    kubectl create configmap licensing-config -n gpu-operator --from-file=./gridd.conf --from-file=./client_configuration_token.tok


  5. Create K8s Secret to Access NGC registry.

    Copy
    Copied!
                

    kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password=’<YOUR API KEY>’ --docker-email=’<YOUR EMAIL>’ -n gpu-operator


  6. Add the Helm Repo.

    Copy
    Copied!
                

    helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>


  7. Update the Helm Repo.

    Copy
    Copied!
                

    helm repo update


  8. Install NVIDIA GPU Operator.

    Copy
    Copied!
                

    helm install --wait gpu-operator nvaie/gpu-operator-1-1 -n gpu-operator


Validate NVIDIA GPU Operator Deployment

  1. Locate the NVIDIA driver daemonset using the command below.

    Copy
    Copied!
                

    kubectl get pods -n gpu-operator


  2. Locate the pods with the title starting with nvidia-driver-daemonset-xxxxx.

  3. Run nvidia-smi with the pods found above.

    Copy
    Copied!
                

    sysadmin@sn-vm:~$ kubectl exec -ti -n gpu-operator nvidia-driver-daemonset-sdtvt -- nvidia-smi Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init) Thu Jan 27 00:53:35 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID T4-16C On | 00000000:02:00.0 Off | 0 | | N/A N/A P8 N/A / N/A | 2220MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+


© Copyright 2022-2023, NVIDIA. Last updated on Sep 11, 2023.