Health Monitor

group DCGMAPI_HM

This chapter describes the methods that handle the GPU health monitor.

Functions

dcgmReturn_t dcgmHealthSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t systems)

Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.

  • systems – IN: An enum representing systems that should be enabled for health checks logically OR’d together. Refer to dcgmHealthSystems_t for details.

Returns

dcgmReturn_t dcgmHealthSet_v2(dcgmHandle_t pDcgmHandle, dcgmHealthSetParams_v2 *params)

Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t.

Since DCGM 2.0

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • healthSet – IN: Parameters to use when setting health watches. See dcgmHealthSetParams_v2 for the description of each parameter.

Returns

dcgmReturn_t dcgmHealthGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t *systems)

Retrieve the current state of the DCGM health check system.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.

  • systems – OUT: An integer representing the enabled systems for the given group Refer to dcgmHealthSystems_t for details.

Returns

dcgmReturn_t dcgmHealthCheck(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthResponse_t *results)

Check the configured watches for any errors/failures/warnings that have occurred since the last time this check was invoked.

On the first call, stateful information about all of the enabled watches within a group is created but no error results are provided. On subsequent calls, any error information will be returned.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing a collection of one or more entities. Refer to dcgmGroupCreate for details on creating a group

  • results – OUT: A reference to the dcgmHealthResponse_t structure to populate. results->version must be set to dcgmHealthResponse_version.

Returns