Health Monitor#

group Health Monitor

This chapter describes the methods that handle the GPU health monitor.

Functions

dcgmReturn_t dcgmHealthSet(
dcgmHandle_t pDcgmHandle,
dcgmGpuGrp_t groupId,
dcgmHealthSystems_t systems
)#

Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.

  • systems – IN: An enum representing systems that should be enabled for health checks logically OR’d together. Refer to dcgmHealthSystems_t for details.

Returns:

dcgmReturn_t dcgmHealthSet_v2(
dcgmHandle_t pDcgmHandle,
dcgmHealthSetParams_v2 *params
)#

Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t.

Since DCGM 2.0

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • healthSet – IN: Parameters to use when setting health watches. See dcgmHealthSetParams_v2 for the description of each parameter.

Returns:

dcgmReturn_t dcgmHealthGet(
dcgmHandle_t pDcgmHandle,
dcgmGpuGrp_t groupId,
dcgmHealthSystems_t *systems
)#

Retrieve the current state of the DCGM health check system.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.

  • systems – OUT: An integer representing the enabled systems for the given group Refer to dcgmHealthSystems_t for details.

Returns:

dcgmReturn_t dcgmHealthCheck(
dcgmHandle_t pDcgmHandle,
dcgmGpuGrp_t groupId,
dcgmHealthResponse_t *results
)#

Check the configured watches for any errors/failures/warnings that have occurred since the last time this check was invoked.

On the first call, stateful information about all of the enabled watches within a group is created but no error results are provided. On subsequent calls, any error information will be returned.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing a collection of one or more entities. Refer to dcgmGroupCreate for details on creating a group

  • results – OUT: A reference to the dcgmHealthResponse_t structure to populate. results->version must be set to dcgmHealthResponse_version.

Returns: