1.7. Health Monitor
This chapter describes the methods that handle the GPU health monitor.
Functions
- dcgmReturn_t dcgmHealthCheck ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthResponse_t* results )
- dcgmReturn_t dcgmHealthGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t* systems )
- dcgmReturn_t dcgmHealthSet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t systems )
- dcgmReturn_t dcgmHealthSet_v2 ( dcgmHandle_t pDcgmHandle, dcgmHealthSetParams_v2* params )
Functions
- dcgmReturn_t dcgmHealthCheck ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthResponse_t* results )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing a collection of one or more entities. Refer to dcgmGroupCreate for details on creating a group
- results
- OUT: A reference to the dcgmHealthResponse_t structure to populate. results->version must be set to dcgmHealthResponse_version.
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_BADPARAM if a parameter is invalid
- DCGM_ST_VER_MISMATCH if results->version is not dcgmHealthResponse_version
Description
Check the configured watches for any errors/failures/warnings that have occurred since the last time this check was invoked. On the first call, stateful information about all of the enabled watches within a group is created but no error results are provided. On subsequent calls, any error information will be returned.
- dcgmReturn_t dcgmHealthGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t* systems )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.
- systems
- OUT: An integer representing the enabled systems for the given group Refer to dcgmHealthSystems_t for details.
Description
Retrieve the current state of the DCGM health check system
- dcgmReturn_t dcgmHealthSet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t systems )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.
- systems
- IN: An enum representing systems that should be enabled for health checks logically OR'd together. Refer to dcgmHealthSystems_t for details.
Description
Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t
- dcgmReturn_t dcgmHealthSet_v2 ( dcgmHandle_t pDcgmHandle, dcgmHealthSetParams_v2* params )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- params
Description
Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t
Since DCGM 2.0