Policies

group DCGMAPI_PO

This chapter describes the methods that handle system policy management and violation settings.

The APIs in Policies module can be broken down into following categories:

Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.

Functions

dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Set the current violation policy inside the policy manager.

Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • policy – IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.

  • statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

dcgmReturn_t dcgmPolicyGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Get the current violation policy inside the policy manager.

Given a groupId, a number of policy structures are retrieved.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • count – IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.

  • policy – OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.

  • statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

dcgmReturn_t dcgmPolicyRegister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback)

Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated.

This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to register a callback function

  • beginCallback – IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.

  • finishCallback – IN: A reference to a function that should be called should a violation occur. This function will be called after any action specified by the policy are completed.

Returns:

dcgmReturn_t dcgmPolicyUnregister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition)

Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t).

This function will unregister all callbacks for a given condition and handle.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to unregister a callback function

Returns:

Describes APIs which can be used to perform direct actions (e.g.

Perform GPU Reset, Run Health Diagnostics) on a group of GPUs.

Functions

dcgmReturn_t dcgmActionValidate(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_t *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

*************************************** DEPRECATED ***************************************

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • validate – IN: The validation to perform after the action.

  • response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.

Returns:

dcgmReturn_t dcgmActionValidate_v2(dcgmHandle_t pDcgmHandle, dcgmRunDiag_v7 *drd, dcgmDiagResponse_t *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • drd – IN: Contains the group id, test names, test parameters, struct version, and the validation that should be performed. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.

Returns:

dcgmReturn_t dcgmRunDiagnostic(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_t *diagResponse)

Run a diagnostic on a group of GPUs.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • diagLevel – IN: Diagnostic level to run

  • diagResponse

    IN/OUT: Result of running the DCGM diagnostic.

    .version should be set to

    dcgmDiagResponse_version before this call.

Returns:

Setup and Management

group DCGMAPI_PO_Setup

Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.

Functions

dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Set the current violation policy inside the policy manager.

Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • policy – IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.

  • statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

dcgmReturn_t dcgmPolicyGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Get the current violation policy inside the policy manager.

Given a groupId, a number of policy structures are retrieved.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • count – IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.

  • policy – OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.

  • statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

dcgmReturn_t dcgmPolicyRegister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback)

Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated.

This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to register a callback function

  • beginCallback – IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.

  • finishCallback – IN: A reference to a function that should be called should a violation occur. This function will be called after any action specified by the policy are completed.

Returns:

dcgmReturn_t dcgmPolicyUnregister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition)

Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t).

This function will unregister all callbacks for a given condition and handle.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to unregister a callback function

Returns:

Manual Invocation

group DCGMAPI_PO_MI

Describes APIs which can be used to perform direct actions (e.g.

Perform GPU Reset, Run Health Diagnostics) on a group of GPUs.

Functions

dcgmReturn_t dcgmActionValidate(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_t *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

*************************************** DEPRECATED ***************************************

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • validate – IN: The validation to perform after the action.

  • response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.

Returns:

dcgmReturn_t dcgmActionValidate_v2(dcgmHandle_t pDcgmHandle, dcgmRunDiag_v7 *drd, dcgmDiagResponse_t *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • drd – IN: Contains the group id, test names, test parameters, struct version, and the validation that should be performed. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.

Returns:

dcgmReturn_t dcgmRunDiagnostic(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_t *diagResponse)

Run a diagnostic on a group of GPUs.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • diagLevel – IN: Diagnostic level to run

  • diagResponse

    IN/OUT: Result of running the DCGM diagnostic.

    .version should be set to

    dcgmDiagResponse_version before this call.

Returns: