Policies¶
- group DCGMAPI_PO
This chapter describes the methods that handle system policy management and violation settings.
The APIs in Policies module can be broken down into following categories:
Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.
Functions
-
dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)¶
Set the current violation policy inside the policy manager.
Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
policy – IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t
-
dcgmReturn_t dcgmPolicyGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)¶
Get the current violation policy inside the policy manager.
Given a groupId, a number of policy structures are retrieved.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
count – IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.
policy – OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t
-
dcgmReturn_t dcgmPolicyRegister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback)¶
Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated.
This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to register a callback function
beginCallback – IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.
finishCallback – IN: A reference to a function that should be called should a violation occur. This function will be called after any action specified by the policy are completed.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid, beginCallback, or finishCallback is NULL
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId
-
dcgmReturn_t dcgmPolicyUnregister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition)¶
Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t).
This function will unregister all callbacks for a given condition and handle.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to unregister a callback function
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid or callback is NULL
Describes APIs which can be used to perform direct actions (e.g.
Perform GPU Reset, Run Health Diagnostics) on a group of GPUs.
Functions
-
dcgmReturn_t dcgmActionValidate(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_v10 *response)¶
Inform the action manager to perform a manual validation of a group of GPUs on the system.
*************************************** DEPRECATED ***************************************
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
validate – IN: The validation to perform after the action.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_v10 for details. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
-
dcgmReturn_t dcgmActionValidate_v2(dcgmHandle_t pDcgmHandle, dcgmRunDiag_v8 *drd, dcgmDiagResponse_v10 *response)¶
Inform the action manager to perform a manual validation of a group of GPUs on the system.
- Parameters:
pDcgmHandle – IN: DCGM Handle
drd – IN: Contains the group id, test names, test parameters, struct version, and the validation that should be performed. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_v10 for details. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
-
dcgmReturn_t dcgmRunDiagnostic(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_v10 *diagResponse)¶
Run a diagnostic on a group of GPUs.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
diagLevel – IN: Diagnostic level to run
diagResponse –
IN/OUT: Result of running the DCGM diagnostic.
.version should be set to
dcgmDiagResponse_version10 before this call. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the diagnostic is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if a provided parameter is invalid or missing
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)¶
Setup and Management¶
- group DCGMAPI_PO_Setup
Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.
Functions
-
dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)
Set the current violation policy inside the policy manager.
Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
policy – IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t
-
dcgmReturn_t dcgmPolicyGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)
Get the current violation policy inside the policy manager.
Given a groupId, a number of policy structures are retrieved.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
count – IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.
policy – OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t
-
dcgmReturn_t dcgmPolicyRegister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates beginCallback, fpRecvUpdates finishCallback)
Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated.
This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to register a callback function
beginCallback – IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.
finishCallback – IN: A reference to a function that should be called should a violation occur. This function will be called after any action specified by the policy are completed.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid, beginCallback, or finishCallback is NULL
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId
-
dcgmReturn_t dcgmPolicyUnregister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition)
Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t).
This function will unregister all callbacks for a given condition and handle.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to unregister a callback function
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid or callback is NULL
-
dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)
Manual Invocation¶
- group DCGMAPI_PO_MI
Describes APIs which can be used to perform direct actions (e.g.
Perform GPU Reset, Run Health Diagnostics) on a group of GPUs.
Functions
-
dcgmReturn_t dcgmActionValidate(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_v10 *response)
Inform the action manager to perform a manual validation of a group of GPUs on the system.
*************************************** DEPRECATED ***************************************
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
validate – IN: The validation to perform after the action.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_v10 for details. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
-
dcgmReturn_t dcgmActionValidate_v2(dcgmHandle_t pDcgmHandle, dcgmRunDiag_v8 *drd, dcgmDiagResponse_v10 *response)
Inform the action manager to perform a manual validation of a group of GPUs on the system.
- Parameters:
pDcgmHandle – IN: DCGM Handle
drd – IN: Contains the group id, test names, test parameters, struct version, and the validation that should be performed. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_v10 for details. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
-
dcgmReturn_t dcgmRunDiagnostic(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_v10 *diagResponse)
Run a diagnostic on a group of GPUs.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
diagLevel – IN: Diagnostic level to run
diagResponse –
IN/OUT: Result of running the DCGM diagnostic.
.version should be set to
dcgmDiagResponse_version10 before this call. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.
- Returns:
DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the diagnostic is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if a provided parameter is invalid or missing
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmActionValidate(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_v10 *response)