1. Introduction to the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations

NVIDIA® Virtual GPU Management Pack for VMware vRealize Operations enables you to use a VMware vRealize Operations cluster to monitor the performance of NVIDIA physical GPUs and virtual GPUs.

VMware vRealize Operations provides integrated performance, capacity, and configuration management capabilities for VMware vSphere, physical and hybrid cloud environments. It provides a management platform that can be extended by adding third-party management packs. For more information, see the VMware vRealize Operations Documentation.

NVIDIA Virtual GPU Management Pack for VMware vRealize Operations collects metrics and analytics for NVIDIA virtual GPU software from virtual GPU manager instances. It then sends these metrics to the metrics collector in a VMware vRealize Operations cluster, where they are displayed in custom NVIDIA dashboards.



Diagram showing how NVIDIA Virtual GPU Management Pack for VMware vRealize Operations collects metrics and analytics for NVIDIA virtual GPU software and displays them in custom NVIDIA dashboards.

2. Installing and Configuring the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations

The NVIDIA Virtual GPU Management Pack for VMware vRealize Operations is distributed as a PAK (.pak) file. After installing the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, you must configure it by creating an NVIDIA vGPU adapter instance and, if you haven't already done so, by creating a VMware vCenter adapter instance.

2.1. Installation and Configuration Prerequisites

Before installing and configuring the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, ensure that the following prerequisites are met:

  • vRealize Operations Manager v6.6 or later is installed.
  • An NVIDIA virtual GPU software version 5.0 or later driver package is configured on the hosts in your VMware vSphere ESXi cluster.

2.2. Installing or Updating the Management Pack

The NVIDIA Virtual GPU Management Pack for VMware vRealize Operations is distributed as a PAK (.pak) file.

If you have previously installed the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, back up any customized dashboards before updating the management pack. The update will overwrite any NVIDIA dashboard of the same name.

  1. Download the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations PAK (.pak) file. Ensure that the downloaded file is accessible to the web browser that you are using to manage your vRealize Operations Manager instance.
  2. Log in to your vRealize Operations Manager instance as an administrator user.
  3. On the vRealize Operations Manager Home page, follow the Administration link.
  4. Click Solutions and click the plus sign in the toolbar.
  5. Click Browse and navigate to your copy of the PAK file.
  6. If you have previously installed the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, select these options:
    • Install the PAK file even if it is already installed
    • Reset Default Content
  7. Select the PAK file and click Upload.
  8. Accept the EULA for the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations and click Next.
    Note: Uploading and installing the PAK file may take several minutes. Status information appears in the Installation Details text box throughout the installation process.
  9. When the installation in complete, click Finish. This last page displays progress details for the installation.

2.3. Creating an NVIDIA vGPU Adapter Instance

After installing the NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, you must configure it by creating an NVIDIA vGPU adapter instance.

Note: If you haven't already done so, you must also create a VMware vCenter adapter instance.

An NVIDIA vGPU adapter instance connects to a VMware vCenter Server instance and retrieves data from vGPU-enabled hosts in the server instance. You must provide the host name of the VMware vCenter Server instance that the adapter instance will connect to and credentials to be used for connecting to the server instance.

  1. If you are not already logged in, log in to your vRealize Operations Manager instance as an administrator user.
  2. On the vRealize Operations Manager Home page, follow the Administration link.
  3. Click Solutions, select NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, and click Configure on the toolbar.

    The Manage Solution page opens.



    Screen capture showing the Manage Solution page for creating an NVIDIA vGPU adapter instance.

  4. From the Adapter Type list at the top of page, select NVIDIA vGPU Adapter.
  5. Click the plus sign.
  6. Provide the following information about the adapter instance that you are creating:
    Display Name

    Enter the name of the instance as you want it to appear in vRealize Operations Manager.

    Description
    Enter a description that can help distinguish this instance when multiple NVIDIA vGPU adapter instances are configured.
    vCenter Server
    Enter the IP address of the VMware vCenter Server.
    Credential

    Click the plus sign and in the Manage Credential dialog box that opens, add the credentials for the user that will connect to this vCenter Server instance.



    Screen capture showing the Manage Credential dialog box.

    Credential name
    Enter the display name of the user.
    Username
    Enter the user login name.
    Password
    Enter the password of the user.
  7. Click Save Settings.
After installing and configuring NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, verify the installation and configuration as explained in Viewing Data on NVIDIA Dashboards.

3. Managing Metrics and Analytics for NVIDIA Virtual GPU Software in VMware vRealize Operations

Managing metrics and analytics for NVIDIA virtual GPU software in VMware vRealize Operations involves viewing data on NVIDIA dashboards and changing the settings of the NVIDIA vGPU adapter and NVIDIA vGPU alert definitions.

3.1. Viewing Data on NVIDIA Dashboards

After installing and configuring NVIDIA Virtual GPU Management Pack for VMware vRealize Operations, you can view the data on NVIDIA dashboards to verify the installation and configuration. If you have just completed the installation and configuration, allow the adapter to work for ten to fifteen minutes to collect data to display on the dashboards.

  1. On the vRealize Operations Manager Home page, click Dashboards in the menu bar.
  2. In the All Dashboards drop-down list, select the NVIDIA Dashboards group.

    This group contains the following dashboards:

    • NVIDIA Environment Overview
    • NVIDIA Host Summary
    • NVIDIA GPU Summary
    • NVIDIA vGPU Summary
    • NVIDIA Application Summary

3.2. Changing the NVIDIA vGPU Adapter Collection Interval

If you need to change how frequently the NVIDIA vGPU adapter collects metrics, change the collection interval. The default collection interval is five minutes.
  1. If you are not already logged in, log in to your vRealize Operations Manager instance as an administrator user.
  2. On the vRealize Operations Manager Home page, follow the Administration link.
  3. In the left pane, click Configuration.
  4. Click Inventory Explorer and expand Adapter Instances in the center pane.
  5. Expand NVIDIA vGPU Adapter Instance and select the adapter name.
  6. In the right pane, on the List tab, select the adapter name and click Edit Object.
  7. On Advanced Settings, enter the new collection interval in the Collection Interval (Minutes) field.
    Note: The minimum value that you can set is 1 minute.
  8. Click OK.

3.3. Changing the Threshold of a Symptom in an Alert Definition

An alert definition is a combination of symptoms that identify a problem area and generate alerts for that area. Each symptom in an alert is associated with a metric. For each symptom, a threshold value is defined for its associated metric. If the threshold value is reached, an alert is generated.

For detailed information about the alerts defined for NVIDIA vGPU metrics, including the default threshold values of symptoms in these alerts, see NVIDIA vGPU Alert Definitions.

  1. In the menu bar of the vRealize Operations Manager Home page, click Alerts.
  2. In the left pane, click Alert Settings.
  3. Click Symptom Definitions.
  4. Click All Filters, then click Object Type, and type GPU or vGPU. The symptom definitions for the object type that you selected are listed.
  5. Select the symptom definition that you want to change and click the Edit icon.
  6. Change the threshold to the new value that you want and click Save.

    Screen capture showing the window for changing the definition of the GPU Memory Utilization is moderately high symptom for the GPU: Utilization|Memory Utilization metric.

A. NVIDIA vGPU Alert Definitions

The management pack provides alert definitions for the NVIDIA vGPU metrics and analytics that it integrates with VMware vRealize Operations. Each alert definition is a combination of symptoms that identify a problem area and generate alerts for that area.

Alerts defined for GPU utilization can be generated by any of the GPU engines, namely:

  • 3D/Compute
  • Memory controller
  • Video encoder
  • Video decoder

A.1. GPU Utilization Is High

This alert is generated when the utilization of any of the GPU engines is high.

Symptom Associated Metric Criticality Threshold
GPU 3D/Compute Utilization is critically high GPU: Utilization|3D/Compute Utilization Immediate 90
GPU 3D/Compute Utilization is moderately high GPU: Utilization|3D/Compute Utilization Warning 75
GPU Memory Utilization is critically high GPU: Utilization|Memory Utilization Immediate 90
GPU Memory Utilization is moderately high GPU: Utilization|Memory Utilization Warning 75
GPU Encoder Utilization is critically high GPU: Utilization|Encoder Utilization Immediate 90
GPU Encoder Utilization is moderately high GPU: Utilization|Encoder Utilization Warning 75
GPU Decoder Utilization is critically high GPU: Utilization|Decoder Utilization Immediate 90
GPU Decoder Utilization is moderately high GPU: Utilization|Decoder Utilization Warning 75

A.2. vGPU Utilization Is High

This alert is generated when the utilization of any of the GPU engines is high on any virtual GPU.

Symptom Name Associated Metric Criticality Threshold
vGPU 3D/Compute Utilization is critically high vGPU: Utilization|3D/Compute Utilization Immediate 90
vGPU 3D/Compute Utilization is moderately high vGPU: Utilization|3D/Compute Utilization Warning 75
vGPU Memory Utilization is critically high vGPU: Utilization|Memory Utilization Immediate 90
vGPU Memory Utilization is moderately high vGPU: Utilization|Memory Utilization Warning 75
vGPU Encoder Utilization is critically high vGPU: Utilization|Encoder Utilization Immediate 90
vGPU Encoder Utilization is moderately high vGPU: Utilization|Encoder Utilization Warning 75
vGPU Decoder Utilization is critically high vGPU: Utilization|Decoder Utilization Immediate 90
vGPU Decoder Utilization is moderately high vGPU: Utilization|Decoder Utilization Warning 75

A.3. vGPU Utilization Is High for Process

This alert is generated when the utilization of any of the GPU engines is high for any process on any virtual GPU.

Symptom Name Associated Metric Criticality Threshold
vGPU 3D/Compute Utilization is critically high for Process Process: 3D/Compute Utilization Immediate 90
vGPU 3D/Compute Utilization is moderately high for Process Process: 3D/Compute Utilization Warning 75
vGPU Memory Utilization is critically high for Process Process: Memory Utilization Immediate 90
vGPU Memory Utilization is moderately high for Process Process: Memory Utilization Warning 75
vGPU Encoder Utilization is critically high for Process Process: Encoder Utilization Immediate 90
vGPU Encoder Utilization is moderately high for Process Process: Encoder Utilization Warning 75
vGPU Decoder Utilization is critically high for Process Process: Decoder Utilization Immediate 90
vGPU Decoder Utilization is moderately high for Process Process: Decoder Utilization Warning 75

A.4. GPU Temperature Is High

This alert is generated when the GPU temperature is high enough to force slowdown or shutdown.

Symptom Associated Metric Criticality Threshold
GPU Temperature is forcing slowdown GPU: Temperature|Current Temperature Critical Slowdown Temperature
GPU Temperature is forcing shutdown GPU: Temperature|Current Temperature Immediate Shutdown Temperature minus 5

Notices

Notice

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

HDMI

HDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing LLC.

OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.

Trademarks

NVIDIA, the NVIDIA logo, NVIDIA GRID, vGPU, Pascal, Quadro, and Tesla are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.