1.7. Health Monitor

This chapter describes the methods that handle the GPU health monitor.

Functions

dcgmReturn_t dcgmHealthCheck ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthResponse_t* results )
dcgmReturn_t dcgmHealthGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t* systems )
dcgmReturn_t dcgmHealthSet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t systems )
dcgmReturn_t dcgmHealthSet_v2 ( dcgmHandle_t pDcgmHandle, dcgmHealthSetParams_v2* params )

Functions

dcgmReturn_t dcgmHealthCheck ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthResponse_t* results )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing a collection of one or more entities. Refer to dcgmGroupCreate for details on creating a group
results
OUT: A reference to the dcgmHealthResponse_t structure to populate. results->version must be set to dcgmHealthResponse_version.
Returns

Description

Check the configured watches for any errors/failures/warnings that have occurred since the last time this check was invoked. On the first call, stateful information about all of the enabled watches within a group is created but no error results are provided. On subsequent calls, any error information will be returned.

dcgmReturn_t dcgmHealthGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t* systems )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.
systems
OUT: An integer representing the enabled systems for the given group Refer to dcgmHealthSystems_t for details.
Returns

Description

Retrieve the current state of the DCGM health check system

dcgmReturn_t dcgmHealthSet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t systems )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.
systems
IN: An enum representing systems that should be enabled for health checks logically OR'd together. Refer to dcgmHealthSystems_t for details.
Returns

Description

Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t

dcgmReturn_t dcgmHealthSet_v2 ( dcgmHandle_t pDcgmHandle, dcgmHealthSetParams_v2* params )
Parameters
pDcgmHandle
IN: DCGM Handle
params
Returns

Description

Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t

Since DCGM 2.0