Health Monitor¶
- group DCGMAPI_HM
This chapter describes the methods that handle the GPU health monitor.
Functions
-
dcgmReturn_t dcgmHealthSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t systems)¶
Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.
systems – IN: An enum representing systems that should be enabled for health checks logically OR’d together. Refer to dcgmHealthSystems_t for details.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if a parameter is invalid
-
dcgmReturn_t dcgmHealthSet_v2(dcgmHandle_t pDcgmHandle, dcgmHealthSetParams_v2 *params)¶
Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t.
Since DCGM 2.0
- Parameters:
pDcgmHandle – IN: DCGM Handle
healthSet – IN: Parameters to use when setting health watches. See dcgmHealthSetParams_v2 for the description of each parameter.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if a parameter is invalid
-
dcgmReturn_t dcgmHealthGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t *systems)¶
Retrieve the current state of the DCGM health check system.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.
systems – OUT: An integer representing the enabled systems for the given group Refer to dcgmHealthSystems_t for details.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if a parameter is invalid
-
dcgmReturn_t dcgmHealthCheck(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthResponse_t *results)¶
Check the configured watches for any errors/failures/warnings that have occurred since the last time this check was invoked.
On the first call, stateful information about all of the enabled watches within a group is created but no error results are provided. On subsequent calls, any error information will be returned.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing a collection of one or more entities. Refer to dcgmGroupCreate for details on creating a group
results – OUT: A reference to the dcgmHealthResponse_t structure to populate. results->version must be set to dcgmHealthResponse_version.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if a parameter is invalid
DCGM_ST_VER_MISMATCH if results->version is not dcgmHealthResponse_version
-
dcgmReturn_t dcgmHealthSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmHealthSystems_t systems)¶