Installing and Configuring NVIDIA AI Enterprise Host Software#

This section covers installing and configuring the NVIDIA AI Enterprise Host Software:

  • Preparing the VIB file for Install

  • Uploading VIB in vSphere Client

  • Installing NVIDIA AI Enterprise Host Software with the VIB

  • Updating the VIB

  • Verifying the Installation of the VIB

  • Uninstalling the VIB

  • Changing the Default Graphics Type in VMWare vSphere

Preparing the VIB file for Install#

Before you begin, download the archive containing the VIB file and extract the archive contents to a folder. The file ending with .VIB is the file that you must upload to the host data store for installation. For demonstration purposes, these steps use the VMWare vSphere web interface to upload the VIB to the server host.

Uploading VIB in vSphere Web Client#

To upload the VIB file to the data store using vSphere Web Client:

  1. Select the host server and select the Datastores tab.

  2. Right-click the data store and then select Browse Files. The Datastore Browser window displays.

    _images/dg-vgpu-01.png
  3. Click the New Folder icon. The Create a new folder window displays.

  4. Name the new folder VIB and then click OK.

    _images/dg-vgpu-02.png
  5. Select the VIB folder in the Datastore Browser window.

  6. Click the Upload Files button and navigate to the VIB file. Double click on the file to upload. A progress bar should display below. If the operation fails, press Details and follow the instructions to bypass the certificate manually.

    _images/dg-vgpu-03.png

The .VIB file is uploaded to the data store on the host.

Note

If you do not click Allow before the timer runs out, further attempts to upload a file will silently fail. If this happens, exit and restart vSphere Web Client. Repeat this procedure and be sure to click Allow before the timer runs out.

Installing the VIB#

The NVIDIA AI Enterprise Host Software runs on the ESXi host. It is provided in the following formats:

  • As a VIB file, which must be copied to the ESXi host and then installed

  • As an offline bundle that you can import manually as explained in Import Patches Manually

Note

To install the NVIDIA AI Enterprise Host Software (VIB), you need to access the ESXi host via the ESXi Shell or SSH. Refer to VMware’s documentation on how to Enable Access to ESXi Shell or SSH.

Note

Before proceeding with the NVIDIA AI Enterprise Host Software installation, ensure that all VMs are powered off, and the ESXi host is placed in maintenance mode. Refer to VMware’s documentation on how to Place a ESXi Host in Maintenance Mode.

  1. Place the host into Maintenance mode by right-clicking it and then selecting Maintenance Mode - Enter Maintenance Mode.

    _images/dg-vgpu-04.png

    Note

    Alternatively, you can place the host into Maintenance mode using the command prompt by entering:

    esxcli system maintenanceMode set --enable=true
    

    This command will not return a response. Making this change using the command prompt will not refresh the vSphere Web Client UI. Click the Refresh icon in the upper right corner of the vSphere Web Client window.

    Important

    Placing the host into maintenance mode disables any vCenter appliance running on this host until you exit maintenance mode and restart that vCenter appliance.

  2. Click OK to confirm your selection.

  3. Use the esxcli command to install the NVIDIA AI Enterprise Host Software package:

    1[root@esxi:~] esxcli software vib install -v directory/NVIDIA-AIE_ESXi_6.7.0_Driver_470.105-1OEM.670.0.0.8169922.vib
    2Installation Result    Message: Operation finished successfully.
    3Reboot Required: false
    4VIBs Installed: NVIDIA-AIE_ESXi_6.7.0_Driver_470.105-1OEM.670.0.0.8169922
    5VIBs Removed:
    6VIBs Skipped:
    

    The directory is the absolute path to the directory that contains the VIB file. You must specify the absolute path even if the VIB file is in the current working directory. Do not include the ds:/// term in the absolute file path. Instead, start the file path with /vmfs/volumes/... etc.

  4. From the vSphere Web Client, exit Maintenance Mode by right-clicking the host and selecting Exit Maintenance Mode.

    Note

    Although the display states Reboot Required: false, a reboot is necessary for the VIB to load and xorg to start.

    Note

    Alternatively, you may exit from Maintenance mode via the command prompt by entering:

    esxcli system maintenanceMode set --enable=false
    

    This command will not return a response. Making this change via the command prompt will not refresh the vSphere Web Client UI. Click the Refresh icon in the upper right corner of the vSphere Web Client window.

  5. Reboot the host from the vSphere Web Client by right-clicking the host and then selecting Reboot.

    Note

    You can reboot the host by entering the following at the command prompt:

    reboot
    

    This command will not return a response. The Reboot Host window displays.

  6. When rebooting from the vSphere Web Client, enter a descriptive reason for the reboot in the Log a reason for this reboot operation field, and then click OK to proceed.

Updating the VIB#

Update the NVIDIA AI Enterprise Host Software package if you want to install a new version of NVIDIA AI Enterprise Host Software on a system where an existing version is already installed.

  • To update the NVIDIA AI Enterprise Host Software (VIB), you need to access the ESXi host via the ESXi Shell or SSH. Refer to VMware’s documentation on how to enable ESXi Shell or SSH for an ESXi host.

  • The driver version seen within this document is for demonstration purposes. There will be similarities, albeit minor differences, within your local environment.

    Note

    Before proceeding with the NVIDIA AI Enterprise Host Software update, ensure that all VMs are powered off, and the ESXi host is placed in maintenance mode. Refer to VMware’s documentation on how to place an ESXi host in maintenance mode.

  1. Use the esxcli command to update the NVIDIA AI Enterprise Host Software package:

    1[root@esxi:~] esxcli software vib update -v directory/NVIDIA-AIE_ESXi_6.7.0_Driver_470.105-1OEM.670.0.0.8169922.vib
    2Installation Result    Message: Operation finished successfully.
    3Reboot Required: false
    4VIBs Installed: NVIDIA-AIE_ESXi_6.7.0_Driver_470.105-1OEM.670.0.0.8169922
    5VIBs Removed: NVIDIA-vGPU-
    6VMware_ESXi_6.0_Host_Driver_390.57-1OEM.600.0.0.2159203
    7VIBs Skipped:
    
  2. Reboot the ESXi host and remove it from maintenance mode.

Verifying the Installation of the VIB#

After the ESXi host has rebooted, verify the installation of the NVIDIA vGPU software package. You can also view the version of the driver with the steps below.

  1. Verify that the NVIDIA vGPU software package installed and loaded correctly by checking for the NVIDIA kernel driver in the list of kernels loaded modules.

    1[root@esxi:~] vmkload_mod -l | grep nvidia
    2nvidia                   5    8420
    
  2. If the NVIDIA driver is not listed in the output, check dmesg for any load-time errors reported by the driver.

  3. Verify that the NVIDIA kernel driver can successfully communicate with the NVIDIA physical GPUs in your system by running the nvidia-smi command.

    Running the nvidia-smi command should produce a listing of the GPUs in your platform.

     1[root@esxi:~] nvidia-smi
     2Wen January 19 10:10:15 2022
     3+-----------------------------------------------------------------------------+
     4| NVIDIA-SMI 470.105   Driver Version: 470.105   CUDA Version: N/A            |
     5|-------------------------------+----------------------+----------------------+
     6| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     7| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     8|                               |                      |               MIG M. |
     9|===============================+======================+======================|
    10|   0  Tesla T4            On   | 00000000:1A:00.0 Off |                    0 |
    11| N/A   38C    P8    17W /  70W |     83MiB / 15359MiB |      0%      Default |
    12|                               |                      |                  N/A |
    13+-------------------------------+----------------------+----------------------+
    14|   1  Tesla T4            On   | 00000000:3B:00.0 Off |                    0 |
    15| N/A   37C    P8    16W /  70W |     75MiB / 15359MiB |      0%      Default |
    16|                               |                      |                  N/A |
    17+-------------------------------+----------------------+----------------------+
    18|   2  Tesla T4            On   | 00000000:87:00.0 Off |                    0 |
    19| N/A   34C    P8    16W /  70W |     75MiB / 15359MiB |      0%      Default |
    20|                               |                      |                  N/A |
    21+-------------------------------+----------------------+----------------------+
    22|   3  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
    23| N/A   38C    P8    16W /  70W |     75MiB / 15359MiB |      0%      Default |
    24|                               |                      |                  N/A |
    25+-------------------------------+----------------------+----------------------+
    26|   4  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
    27| N/A   36C    P8    16W /  70W |     75MiB / 15359MiB |      0%      Default |
    28|                               |                      |                  N/A |
    29+-------------------------------+----------------------+----------------------+
    30
    31+-----------------------------------------------------------------------------+
    32| Processes:                                                                  |
    33|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    34|        ID   ID                                                   Usage      |
    35|=============================================================================|
    36|  No running processes found                                                 |
    37+-----------------------------------------------------------------------------+
    

If nvidia-smi fails to report the expected output for all the NVIDIA GPUs in your system, see NVIDIA AI Enterprise User Guide for troubleshooting steps.

The NVIDIA System Management Interface nvidia-smi also allows GPU monitoring using the following command:

nvidia-smi -l

This command switch adds a loop, automatically refreshing the display. The default refresh interval is 1 second.

Uninstalling the VIB#

To uninstall NVIDIA AI Enterprise Host Software:

  1. Run esxcli to determine the name of the vGPU driver bundle.

    1   esxcli software vib list | grep -i nvidia
    2   NVIDIA-AIE_ESXi_7.0.2_Driver_470.63-1OEM.702.0.0.17630552
    3   NVIDIA  VMwareAccepted    2022-01-019
    
  2. Run the following command to uninstall the driver package:

    esxcli software vib remove -n NVIDIA-AIE_ESXi_7.0.2_Driver_470.63-1OEM.702.0.0.17630552 - maintenance-mode
    

The following message displays if the uninstall process is successful:

1Removal Result
2    Message: Operation finished successfully.
3    Reboot Required: false
4    VIBs Installed:
5    VIBs Removed: NVIDIA-AIE_ESXi_7.0.2_Driver_470.63-1OEM.702.0.0.17630552
6    VIBs Skipped:

Reboot the host to complete the uninstall of the NVIDIA AI Enterprise Host Software.

Changing the Default Graphics Type in VMware vSphere#

The NVIDIA AI Enterprise Host Software (VIB) for VMware vSphere provides Virtual Shared Graphics Acceleration (vSGA) and vGPU functionality in a single VIB. After this VIB is installed, the default graphics type is Shared, which provides vSGA functionality. To enable vGPU support for VMs in VMware vSphere, you must change the default graphics type to Shared Direct. If you do not modify the default graphics type, VMs to which a vGPU is assigned fail to start, and the following error message is displayed:

The amount of graphics resources available in the parent resource pool is insufficient for the operation.

Change the default graphics type before configuring vGPU. Output from the VM console in the VMware vSphere Web Client is not available for VMs that are running vGPU. Before changing the default graphics type, ensure that the ESXi host is running and that all VMs on the host is powered off.

  1. Log in to vCenter Server by using the vSphere Web Client.

  2. In the navigation tree, select your ESXi host and click the Configure tab.

  3. From the menu, choose Graphics and then click the Host Graphics tab.

  4. On the Host Graphics tab, click Edit.

    _images/dg-vgpu-05.png
  5. In the Edit Host Graphics Settings dialog box that opens, select Shared Direct and click OK.

    _images/dg-vgpu-06.png

    Note

    This dialog box also lets you change the allocation scheme for vGPU-enabled VMs. For more information, see Modifying GPU Allocation Policy on VMware vSphere.

  6. After you click OK, the default graphics type changes to Shared Direct.

  7. Either restart the ESXi host, or stop and restart the Xorg service and nv-hostengine on the ESXi host. To stop and restart the Xorg service and nv-hostengine, perform these steps:

    • Stop the Xorg service.

      [root@esxi:~] /etc/init.d/xorg stop
      
    • Stop nv-hostengine.

      [root@esxi:~] nv-hostengine -t
      
    • Wait for 1 second to allow nv-hostengine to stop,

    • Start nv-hostengine.

      [root@esxi:~] nv-hostengine -d
      
    • Start the Xorg service.

      [root@esxi:~] /etc/init.d/xorg start
      

After changing the default graphics type, configure vGPU as needed in Configuring a vSphere VM with Virtual GPU.

See also the following topics in VMware vSphere documentation: