Policies

group Policies

This chapter describes the methods that handle system policy management and violation settings.

The APIs in Policies module can be broken down into following categories:

Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.

Functions

dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Set the current violation policy inside the policy manager.

Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
policy – IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t

dcgmReturn_t dcgmPolicyGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Get the current violation policy inside the policy manager.

Given a groupId, a number of policy structures are retrieved.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
count – IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.
policy – OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t

dcgmReturn_t dcgmPolicyRegister_v2(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates callback, uint64_t userData)

Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated.

This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to register a callback function
callback – IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.
userData – IN: User data pointer to pass to the userData field of callback

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid, callback, is NULL
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId

dcgmReturn_t dcgmPolicyUnregister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition)

Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t).

This function will unregister all callbacks for a given condition and handle.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to unregister a callback function

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid
DCGM_ST_IN_USE if callback from policy registeration is in progress

Describes APIs which can be used to perform direct actions (e.g.

Perform GPU Reset, Run Health Diagnostics) on a group of GPUs.

Functions

dcgmReturn_t dcgmActionValidate(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_v12 *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

*************************************** DEPRECATED ***************************************

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
validate – IN: The validation to perform after the action.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.

dcgmReturn_t dcgmActionValidate_v2(dcgmHandle_t pDcgmHandle, dcgmRunDiag_v10 *drd, dcgmDiagResponse_v12 *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

Parameters:

pDcgmHandle – IN: DCGM Handle
drd – IN: Contains the group id, test names, test parameters, struct version, and the validation that should be performed. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.

dcgmReturn_t dcgmRunDiagnostic(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_v12 *diagResponse)

Run a diagnostic on a group of GPUs.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
diagLevel – IN: Diagnostic level to run
diagResponse –
IN/OUT: Result of running the DCGM diagnostic.

.version should be set to
dcgmDiagResponse_version before this call.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the diagnostic is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if a provided parameter is invalid or missing
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.

Setup and Management

group Setup and Management

Describes APIs for setting up policies and registering callbacks to receive notification in case specific policy condition has been violated.

Functions

dcgmReturn_t dcgmPolicySet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Set the current violation policy inside the policy manager.

Given the conditions within the dcgmPolicy_t structure, if a violation has occurred, subsequent action(s) may be performed to either report or contain the failure.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
policy – IN: A reference to dcgmPolicy_t that will be applied to all GPUs in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t

dcgmReturn_t dcgmPolicyGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, int count, dcgmPolicy_t *policy, dcgmStatus_t statusHandle)

Get the current violation policy inside the policy manager.

Given a groupId, a number of policy structures are retrieved.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
count – IN: The size of the policy array. This is the maximum number of policies that will be retrieved and ultimately should correspond to the number of GPUs specified in the group.
policy – OUT: A reference to dcgmPolicy_t that will used as storage for the current policies applied to each GPU in the group.
statusHandle – IN/OUT: Resulting status for the operation. Pass it as NULL if the detailed error information for the operation is not needed. Refer to dcgmStatusCreate for details on creating a status handle.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId or policy is invalid
DCGM_ST_* a different error has occurred and is stored in statusHandle. Refer to dcgmReturn_t

dcgmReturn_t dcgmPolicyRegister_v2(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition, fpRecvUpdates callback, uint64_t userData)

Register a function to be called when a specific policy condition (see dcgmPolicyCondition_t) has been violated.

This callback(s) will be called automatically when in DCGM_OPERATION_MODE_AUTO mode and only after dcgmPolicyTrigger when in DCGM_OPERATION_MODE_MANUAL mode. All callbacks are made within a separate thread.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to register a callback function
callback – IN: A reference to a function that should be called should a violation occur. This function will be called prior to any actions specified by the policy are taken.
userData – IN: User data pointer to pass to the userData field of callback

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid, callback, is NULL
DCGM_ST_NOT_SUPPORTED if any unsupported GPUs are part of the GPU group specified in groupId

dcgmReturn_t dcgmPolicyUnregister(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyCondition_t condition)

Unregister a function to be called for a specific policy condition (see dcgmPolicyCondition_t).

This function will unregister all callbacks for a given condition and handle.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
condition – IN: The set of conditions specified as an OR’d list (see dcgmPolicyCondition_t) for which to unregister a callback function

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if groupId, condition, is invalid
DCGM_ST_IN_USE if callback from policy registeration is in progress

Manual Invocation

group Manual Invocation

Describes APIs which can be used to perform direct actions (e.g.

Perform GPU Reset, Run Health Diagnostics) on a group of GPUs.

Functions

dcgmReturn_t dcgmActionValidate(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_v12 *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

*************************************** DEPRECATED ***************************************

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
validate – IN: The validation to perform after the action.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.

dcgmReturn_t dcgmActionValidate_v2(dcgmHandle_t pDcgmHandle, dcgmRunDiag_v10 *drd, dcgmDiagResponse_v12 *response)

Inform the action manager to perform a manual validation of a group of GPUs on the system.

Parameters:

pDcgmHandle – IN: DCGM Handle
drd – IN: Contains the group id, test names, test parameters, struct version, and the validation that should be performed. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
response – OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details. Note: It’s a caller’s responsibility to make sure the response is zero-initialized, except for the version field.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.

dcgmReturn_t dcgmRunDiagnostic(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_v12 *diagResponse)

Run a diagnostic on a group of GPUs.

Parameters:

pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
diagLevel – IN: Diagnostic level to run
diagResponse –
IN/OUT: Result of running the DCGM diagnostic.

.version should be set to
dcgmDiagResponse_version before this call.

Returns:

DCGM_ST_OK if the call was successful
DCGM_ST_NOT_SUPPORTED if running the diagnostic is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
DCGM_ST_BADPARAM if a provided parameter is invalid or missing
DCGM_ST_GENERIC_ERROR an internal error has occurred
DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.