Optimize VMware vSphere for AI and Data Science Workloads

Step #2: Create Your First NVIDIA AI Enterprise VM

To proceed with this guide, you will create a VM with the hardware configuration in the steps below. This VM will be used for training (using TensorFlow) as well as for deploying Triton Inference Server.

Note

Within a production environment, two VMs would be created. One VM would be the AI Training VM and the other VM would be to host the Triton Inference Server.

Within your AI LaunchPad journey you will create a VM from scratch that will support NVIDIA AI Enterprise. Later, the VM will be used as a gold master image.

  1. Select AI Launchpad host in the left pane of the vSphere Client.

  2. Right-click the LaunchPad host and select New Virtual Machine.

    dg-first-vm-02.png

  3. Select Create a new virtual machine and click Next.

    dg-first-vm-03.png

  4. Enter NLP for the virtual machine name. Next, choose the location to host the virtual machine using the Select a location for the virtual machine section. Click Next to continue.

    dg-first-vm-04.png

  5. Select a compute resource to run the VM. Click Next to continue.

    Note

    This compute resource should include an NVIDIA AI Enterprise enabled GPU which has been installed and correctly configured.

    dg-first-vm-05.png

  6. Select the datastore to host the virtual machine. Click Next to continue.

    dg-first-vm-06.png

  7. Next, select compatibility for the virtual machine. This should reflect the ESXi version for your NVIDIA-Certified Systems. Click Next to continue.

    dg-first-vm-07.png

  8. Select the appropriate Ubuntu Linux OS from the Guest OS Family and Guest OS Version pull-down menus. Click Next to continue.

    dg-first-vm-08.png

  9. Customize hardware is next. Set the virtual hardware based on the table below. Click Next to continue.

    Virtual Machine Configuration

    CPU

    16 vCPU on a single socket

    RAM

    64GB

    Storage

    150GB thin provisioned disk

  10. Expand the CPU options by clicking the greater than sign. Set the CPU to 16 and the Cores per Socket to 16.

    dg-first-vm-39.png

  11. Next set the Memory to 64 GB.

    dg-first-vm-40.png

  12. Next expand the New Hard disk option by clicking on the greater than sign. Set the storage to 150 GB and the Disk Provisioning to Thin Provision.

    dg-first-vm-41.png

  13. Review the New Virtual Machine configuration before completion. Click Finish when ready.

    dg-first-vm-10.png

  14. The new virtual machine container is created.

  15. Configure the VM boot options for EFI. Right-click on the new VM and select Edit Settings.

    dg-first-vm-11.png

  16. Click on the VM Options tab, expand Boot Options, change the Firmware from BIOS to EFI.

    dg-first-vm-12.png

  17. Expand Advanced and select Edit Configuration.

    dg-first-vm-35.png

  18. Click Add Configuration Params button.

    dg-first-vm-42.png

  19. Adjust the Memory Mapped I/O (MMIO) settings for the VM

    • Add the parameters from the table below.

    Name

    Value

    pciPassthru.64bitMMIOSizeGB

    128

    dg-first-vm-37.png

    • Click Add Configuration Params again and add the parameters from the table.

    Name

    Value

    pciPassthru.64bitMMIO

    True

    dg-first-vm-38.png


  20. Click Ok to close the advance configuration window, then click Ok to complete the VM configuration.

NVIDIA AI Enterprise is supported on Ubuntu 20.04 LTS operating systems. It is important to note there are two Ubuntu ISO types: Desktop and Live Server. The Desktop version includes a graphical user interface (GUI), while the Live Server version only operates via a command line. Within your LaunchPad journey you will use the Live Server version 20.04 (amd64 architecture) of Ubuntu.

  1. Right-click on the VM and select Edit Settings.

  2. Under CDDVD drive 1 select Datastore ISO File from the drop down menu.

    dg-first-vm-44.png

  3. Expand the datastore by clicking the greater than sign and select the ubuntu-20.04.2-live-server-amd64.iso file and click OK.

    dg-first-vm-45.png

  4. Make sure to check the Connect at power on button and click OK.

    dg-first-vm-46.png

  5. Power on the VM.

    dg-first-vm-47.png

  6. Launch Web Console and wait for the install to appear.

    dg-first-vm-48.png

  7. Select your preferred language and press the enter key.

    dg-first-vm-14.png

  8. Continue without updating as this guide is built around 20.04.

    dg-first-vm-15.png

  9. Configure the keyboard layout and press the enter key.

    dg-first-vm-16.png

  10. On this screen, select your network connection type and modify it to fit your internal requirements. This guide uses DHCP for the configuration.

    dg-first-vm-17.png

  11. In your LaunchPad Journey, you will not use a proxy address.

    dg-first-vm-18.png

  12. Use the default address and press Done.

    dg-first-vm-19.png

  13. Select Use an entire disk and uncheck Set up this disk as an LVM group if it is selected. Click Done.

    dg-first-vm-49.png

  14. Review the file system summary and select Done if satisfactory.

    dg-first-vm-21.png

  15. Select Continue, on the Confirm Destructive Action screen.

    dg-first-vm-43.png

  16. Configure the VM with a user account, name, and password.

    • Username: temp

    • Password: launchpad!

    dg-first-vm-22.png


  17. Select Install OpenSSH server and select Done.

    dg-first-vm-23.png

  18. Click Done to start the OS installation. This may take several minutes to complete.

    dg-first-vm-24.png

  19. Select Reboot Now on the Ubuntu OS screen.

    dg-first-vm-25.png

  20. When the reboot is complete, return to vCenter. Right click on the VM, select Po**wer, and click Power Off.

    dg-first-vm-50.png

  21. Click on the VM in the Navigator window. Right-click the VM and select Edit Settings. Uncheck Connect check box on the CD/DVD drive 1.

Use the following procedure to enable vGPU support for your virtual machine. You must edit the virtual machine settings.

  1. Right click on the VM and click Edit Settings…

    dg-first-vm-26.png

  2. Click on the Add New Device bar and select PCI device.

    dg-first-vm-28.png

  3. Select the desired GPU Profile underneath the New PCI device drop-down.

    dg-first-vm-30.png
    Note

    The NVIDIA vGPU listed within LaunchPad should be A30-24C. NVIDIA AI Enterprise requires a C-series profile.


  4. Power on the VM.

    dg-first-vm-47.png

Note

A single VM may have multiple GPU (PCI devices) attached, however, this requires that each GPU be configured with maximum memory allocation.

GPU partitions can be a valid option for executing Deep Learning workloads for Ampere based GPUs. An example is Deep Learning training workflows, which utilize smaller sentence sizes, smaller models, or batch sizes. Inferencing workloads typically don’t require as much GPU memory as training workflows, and the model is generally quantized to run at a lower memory footprint (INT8 and FP16). vGPU with MIG partitioning, allows for a single GPU to be sliced up to seven accelerators. These partitions can then be leveraged by up to seven different VMs, bringing optimal GPU utilization and VM density. To turn MIG on or off on the server, please refer to the Advanced GPU Configuration section of NVIDIA AI Enterprise for VMware vSphere Deployment Guide.

Using MIG partitions for Triton Inference server deployments within a production environment provides a better ROI for many organizations. Therefore, when you are doing your POC, the Triton VM can be assigned a fractional MIG profile such as A100-3-20C. Additional information on MIG is located here.

Now that you created a Linux VM, we will boot the VM, and install the NVIDIA AI Enterprise Guest driver in the VM to fully enable GPU operation.

Downloading the NVIDIA AI Enterprise Software Driver Using NGC

Important

Before you begin you will need to generate or use an existing API key.

You received an email from NVIDIA NGC when you were approved for NVIDIA LaunchPad, if you have not done so already, please click on the link within the email to activate the NVIDIA AI Enterprise NGC Catalog.

  1. From a browser, go to https://ngc.nvidia.com/signin/email and then enter your email and password.

  2. In the top right corner, click your user account icon and select Setup.

  3. Click Get API key to open the Setup > API Key page.

    Note

    The API Key is the mechanism used to authenticate your access to the NGC container registry.


  4. Click Generate API Key to generate your API key. A warning message appears to let you know that your old API key will become invalid if you create a new key.

  5. Click Confirm to generate the key.

  6. Your API key appears.

    Important

    You only need to generate an API Key once. NGC does not save your key, so store it in a secure place. (You can copy your API Key to the clipboard by clicking the copy icon to the right of the API key.)Should you lose your API Key, you can generate a new one from the NGC website. When you generate a new API Key, the old one is invalidated.


  7. Now you will log into the VM using the VM Console link on the left pane of this page.

    vm-console-highlight.png

  8. Log in using the credentials previously set in Step 16 from the Installing Ubuntu Server 20.04 LTS (Focal Fossa) section.

    vm-console-lg.png

  9. Disable Nouveau using the commands below.

    Copy
    Copied!
                

    $ printf 'blacklist nouveau\noptions nouveau modeset=0\n' | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf $ sudo update-initramfs -u $ sudo shutdown -r now


  10. Close the VM Console window once the session has ended.

  11. Wait 60 seconds and log into the VM using the VM Console link on the left pane of this page again.

  12. Run the following commands to install the NGC CLI.

    • Install unzip:

    Copy
    Copied!
                

    sudo apt-get install unzip

    • Download, unzip, and install from the command line by moving to a directory where you have execute permissions and then running the following command:

    Copy
    Copied!
                

    $ wget -O ngccli_linux.zip https://ngc.nvidia.com/downloads/ngccli_linux.zip && unzip -o ngccli_linux.zip && chmod u+x ngc

    • Check the binary’s md5 hash to ensure the file wasn’t corrupted during download:

    Copy
    Copied!
                

    $ md5sum -c ngc.md5

    • Add your current directory to path:

    Copy
    Copied!
                

    $ echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bash_profile && source ~/.bash_profile

    • You must configure NGC CLI for your use so that you can run the commands. Enter the following command, including your API key when prompted:

    Copy
    Copied!
                

    $ ngc config set Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: (COPY/PASTE API KEY) Enter CLI output format type [ascii]. Choices: [ascii, csv, json]: ascii Enter org [no-org]. Choices: ['ea-nvidia-ai-enterprise']: Enter team [no-team]. Choices: ['no-team']: Enter ace [no-ace]. Choices: ['no-ace']:

    • The following will be outputted to the console:

    Copy
    Copied!
                

    Successfully saved NGC configuration to /home/$username/.ngc/config

    • Download the NVIDIA AI Enterprise Software Driver.

    Copy
    Copied!
                

    $ ngc registry resource download-version "ea-nvidia-ai-enterprise/vgpu_guest_driver:470.63.01-ubuntu20.04"


Installing the NVIDIA Driver using the .run file

Installation of the NVIDIA AI Enterprise software driver for Linux requires:

  • Compiler toolchain

  • Kernel headers

  1. Check for updates.

    Copy
    Copied!
                

    $ sudo apt-get update


  2. Installation of the NVIDIA AI Enterprise software driver for Linux requires compiler toolchain and kernel headers. Running the command below satisfies these requirements, by installing the gcc compiler and the make tool.

    Copy
    Copied!
                

    $ sudo apt-get install build-essential


  3. Navigate to the directory containing the NVIDIA Driver .run file. Then, add the executable permission to the NVIDIA Driver file using the chmod command.

    Copy
    Copied!
                

    $ cd vgpu_guest_driver_v470.63.01-ubuntu20.04/ $ sudo chmod +x NVIDIA-Linux-x86_64-470.63.01-grid.run


  4. From a console shell, run the driver installer as the root user, and accept defaults.

    Copy
    Copied!
                

    $ sudo sh ./NVIDIA-Linux-x86_64-470.63.01-grid.run

    Note

    After the driver install has ran, the following screen may be displayed. In such case, verify that you have assigned a vGPU PCIe device the VM. Repeat driver install after properly assigning the PCIe device.

    gpu-install01.png

  5. The following screen will be displayed after the vGPU driver has been installed, select OK.

    gpu-install02.png

  6. Select Yes.

    gpu-install03.png

  7. Reboot the system and log in.

  8. After the system has rebooted, confirm that you can see your NVIDIA vGPU device in the output from nvidia-smi.

    Copy
    Copied!
                

    $ nvidia-smi


  9. The following nvidia-smi verifies the installation of the driver.

    Copy
    Copied!
                

    Last login: Wed Feb 9 08:27:16 2022 temp@NLP:~$ nvidia-smi Wed Feb 9 08:51:30 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A30-24C On | 00000000:02:00.0 Off | N/A | | N/A N/A P0 N/A / N/A | 2236MiB / 24571MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ temp@NLP:~$


After installing the NVIDIA AI Enterprise guest driver, you will need license the NVIDIA AI Enterprise Software.

To use an NVIDIA vGPU software licensed product, each client system to which a physical or virtual GPU is assigned must be able to obtain a license from the NVIDIA License System.

  1. Download the token file with the command below.

    Copy
    Copied!
                

    $ ngc registry resource download-version "nvlp-aienterprise/licensetoken:1"

    Note

    The license will be inside the folder that you just downloaded.


  2. Find the name of your token by using the list command.

    Copy
    Copied!
                

    $ ls


  3. Copy the token file to the /etc/nvida/ClientConfigToken.

    Copy
    Copied!
                

    $ sudo cp client_configuration_token.tok /etc/nvidia/ClientConfigToken/


  4. Ensure that the client_configuration_token.tok file has Read and Write permissions.

    Copy
    Copied!
                

    $ sudo chmod +rw /etc/nvidia/ClientConfigToken/client_configuration_token.tok


  5. Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf.

    Copy
    Copied!
                

    $ sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf


  6. Set FeatureType to 4 in gridd.conf.

    Copy
    Copied!
                

    $ sudo nano /etc/nvidia/gridd.conf


  7. Restart the nvidia-gridd service.

    Copy
    Copied!
                

    $ sudo systemctl restart nvidia-gridd

    Note

    Please allow for 5 to 10 minutes for the license to apply after restarting nvidia-gridd service.


  8. You can confirm that VM is licensed by running the command below.

    Copy
    Copied!
                

    $ nvidia-smi -q |modern

    Copy
    Copied!
                

    temp@NLP:~$ nvidia-smi -q |more ==============NVSMI LOG============== Timestamp : Wed Feb 9 08:53:01 2022 Driver Version : 470.63.01 CUDA Version : 11.4 Attached GPUs : 1 GPU 00000000:02:00.0 Product Name : NVIDIA A30-24C Product Brand : NVIDIA Virtual Compute Server Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled MIG Mode Current : Disabled Pending : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-c5649d10-2334-11b2-99d7-7b62f705120a Minor Number : 0 VBIOS Version : 00.00.00.00.00 MultiGPU Board : No Board ID : 0x200 GPU Part Number : N/A Module ID : N/A Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : VGPU Host VGPU Mode : N/A vGPU Software Licensed Product Product Name : NVIDIA Virtual Compute Server License Status : Licensed (Expiry: 2022-2-10 16:27:5 GMT)


© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.