1.8.2. Manual Invocation
[Policies]
Describes APIs which can be used to perform direct actions (e.g. Perform GPU Reset, Run Health Diagnostics) on a group of GPUs.
Functions
- dcgmReturn_t dcgmActionValidate ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_t* response )
- dcgmReturn_t dcgmActionValidate_v2 ( dcgmHandle_t pDcgmHandle, dcgmRunDiag_v7* drd, dcgmDiagResponse_t* response )
- dcgmReturn_t dcgmRunDiagnostic ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_t* diagResponse )
Functions
- dcgmReturn_t dcgmActionValidate ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmPolicyValidation_t validate, dcgmDiagResponse_t* response )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
- validate
- IN: The validation to perform after the action.
- response
- OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
- DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
- DCGM_ST_GENERIC_ERROR an internal error has occurred
- DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
Description
Inform the action manager to perform a manual validation of a group of GPUs on the system
*************************************** DEPRECATED ***************************************
- dcgmReturn_t dcgmActionValidate_v2 ( dcgmHandle_t pDcgmHandle, dcgmRunDiag_v7* drd, dcgmDiagResponse_t* response )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- drd
- IN: Contains the group id, test names, test parameters, struct version, and the validation that should be performed. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
- response
- OUT: Result of the validation process. Refer to dcgmDiagResponse_t for details.
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_NOT_SUPPORTED if running the specified validate is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
- DCGM_ST_BADPARAM if groupId, validate, or statusHandle is invalid
- DCGM_ST_GENERIC_ERROR an internal error has occurred
- DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
Description
Inform the action manager to perform a manual validation of a group of GPUs on the system
- dcgmReturn_t dcgmRunDiagnostic ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmDiagnosticLevel_t diagLevel, dcgmDiagResponse_t* diagResponse )
-
Parameters
- pDcgmHandle
- IN: DCGM Handle
- groupId
- IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
- diagLevel
- IN: Diagnostic level to run
- diagResponse
- IN/OUT: Result of running the DCGM diagnostic. .version should be set to dcgmDiagResponse_version before this call.
Returns
- DCGM_ST_OK if the call was successful
- DCGM_ST_NOT_SUPPORTED if running the diagnostic is not supported. This is usually due to the Tesla recommended driver not being installed on the system.
- DCGM_ST_BADPARAM if a provided parameter is invalid or missing
- DCGM_ST_GENERIC_ERROR an internal error has occurred
- DCGM_ST_GROUP_INCOMPATIBLE if groupId refers to a group of non-homogeneous GPUs. This is currently not allowed.
- DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
Description
Run a diagnostic on a group of GPUs