2.8.1. Setup and Management
[Policies]
Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.
Functions
- dcgmReturn_t dcgmPolicyGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
- dcgmReturn_t dcgmPolicyRegister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback )
- dcgmReturn_t dcgmPolicySet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
- dcgmReturn_t dcgmPolicyUnregister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition )
Functions
- dcgmReturn_t dcgmPolicyGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
- count
- IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.
- policy
- OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.
- statusHandle
- IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_BADPARAM if groupId or policy is invalid
- DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t
Description
Get the current violation policy inside the policy manager. Given a groupId, a number of policy structures are retrieved.
- dcgmReturn_t dcgmPolicyRegister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
- condition
- IN: The set of conditions specified as an OR'd list (see dcgmPolicyCondition_t) for which to register a callback function
- beginCallback
- IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.
- finishCallback
- IN: A reference to a function that should be called should a violation occur. This function will be called after any action specified by the policy are completed.
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_BADPARAM if groupId, condition, is invalid, beginCallback, or finishCallback is NULL
- DCGM_ST_NOT_SUPPORTED if any non-Tesla GPUs are part of the GPU group specified in groupId
Description
Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated. This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.
This API is only supported on Tesla GPUs and will return DCGM_ST_NOT_SUPPORTED if any non-Tesla GPUs are part of the GPU group specified in groupId.
- dcgmReturn_t dcgmPolicySet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
- policy
- IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.
- statusHandle
- IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_BADPARAM if groupId or policy is invalid
- DCGM_ST_NOT_SUPPORTED if any non-Tesla GPUs are part of the GPU group specified in groupId
- DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t
Description
Set the current violation policy inside the policy manager. Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.
This API is only supported on Tesla GPUs and will return DCGM_ST_NOT_SUPPORTED if any non-Tesla GPUs are part of the GPU group specified in groupId.
- dcgmReturn_t dcgmPolicyUnregister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
- condition
- IN: The set of conditions specified as an OR'd list (see dcgmPolicyCondition_t) for which to unregister a callback function
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_BADPARAM if groupId, condition, is invalid or callback is NULL
Description
Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t). This function will unregister all callbacks for a given condition and handle.