1.8.1. Setup and Management

[Policies]

Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.

Functions

dcgmReturn_t dcgmPolicyGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int  count, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
dcgmReturn_t dcgmPolicyRegister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback )
dcgmReturn_t dcgmPolicySet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
dcgmReturn_t dcgmPolicyUnregister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition )

Functions

dcgmReturn_t dcgmPolicyGet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int  count, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
count
IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.
policy
OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.
statusHandle
IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
Returns

Description

Get the current violation policy inside the policy manager. Given a groupId, a number of policy structures are retrieved.

dcgmReturn_t dcgmPolicyRegister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition
IN: The set of conditions specified as an OR'd list (see dcgmPolicyCondition_t) for which to register a callback function
beginCallback
IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.
finishCallback
IN: A reference to a function that should be called should a violation occur. This function will be called after any action specified by the policy are completed.
Returns

Description

Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated. This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.

dcgmReturn_t dcgmPolicySet ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t* policy, dcgmStatus_t statusHandle )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
policy
IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.
statusHandle
IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
Returns

Description

Set the current violation policy inside the policy manager. Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.

dcgmReturn_t dcgmPolicyUnregister ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition
IN: The set of conditions specified as an OR'd list (see dcgmPolicyCondition_t) for which to unregister a callback function
Returns

Description

Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t). This function will unregister all callbacks for a given condition and handle.