Configuration¶
- group DCGMAPI_DC
This chapter describes the methods that handle device configuration retrieval and default settings.
The APIs in Configuration module can be broken down into following categories:
Describes APIs to Get/Set configuration on the group of GPUs.
Functions
-
dcgmReturn_t dcgmConfigSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfig_t *pDeviceConfig, dcgmStatus_t statusHandle)¶
Used to set configuration for the group of one or more GPUs identified by groupId.
The configuration settings specified in pDeviceConfig are applied to all the GPUs in the group. Since DCGM group is a logical grouping of GPUs, the configuration settings stays intact for the individual GPUs even after the group is destroyed.
If the user wishes to ignore the configuration of one or more properties in the input pDeviceConfig then the property should be specified as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property to be ignored.
If any of the properties fail to be configured for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.
To find out valid supported clock values that can be passed to dcgmConfigSet, look at the device attributes of a GPU in the group using the API dcgmGetDeviceAttributes.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.
pDeviceConfig – IN: Pointer to memory to hold desired configuration to be applied for all the GPU in the group represented by groupId. The caller must populate the version field of pDeviceConfig.
statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.
- Returns
DCGM_ST_OK if the configuration has been successfully set.
DCGM_ST_BADPARAM if any of groupId or pDeviceConfig is invalid.
DCGM_ST_VER_MISMATCH if pDeviceConfig has the incorrect version.
DCGM_ST_GENERIC_ERROR if an unknown error has occurred.
-
dcgmReturn_t dcgmConfigGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfigType_t type, int count, dcgmConfig_t deviceConfigList[], dcgmStatus_t statusHandle)¶
Used to get configuration for all the GPUs present in the group.
This API can get the most recent target or desired configuration set by dcgmConfigSet. Set type as DCGM_CONFIG_TARGET_STATE to get target configuration. The target configuration properties are maintained by DCGM and are automatically enforced after a GPU reset or reinitialization is completed.
The method can also be used to get the actual configuration state for the GPUs in the group. Set type as DCGM_CONFIG_CURRENT_STATE to get the actually configuration state. Ideally, the actual configuration state will be exact same as the target configuration state.
If any of the property in the target configuration is unknown then the property value in the output is populated as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property.
If any of the property in the current configuration state is not supported then the property value in the output is populated as one of DCGM_INT32_NOT_SUPPORTED, DCGM_INT64_NOT_SUPPORTED, DCGM_FP64_NOT_SUPPORTED or DCGM_STR_NOT_SUPPORTED based on the data type of the property.
If any of the properties can’t be fetched for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.
type – IN: Type of configuration values to be fetched.
count – IN: The number of entries that deviceConfigList array can store.
deviceConfigList – OUT: Pointer to memory to hold requested configuration corresponding to all the GPUs in the group (groupId). The size of the memory must be greater than or equal to hold output information for the number of GPUs present in the group (groupId).
statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.
- Returns
DCGM_ST_OK if the configuration has been successfully fetched.
DCGM_ST_BADPARAM if any of groupId, type, count, or deviceConfigList is invalid.
DCGM_ST_NOT_CONFIGURED if the target configuration is not already set.
DCGM_ST_VER_MISMATCH if deviceConfigList has the incorrect version.
DCGM_ST_GENERIC_ERROR if an unknown error has occurred.
Describes APIs used to manually enforce the desired configuration on a group of GPUs.
Functions
-
dcgmReturn_t dcgmConfigEnforce(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmStatus_t statusHandle)¶
Used to enforce previously set configuration for all the GPUs present in the group.
This API provides a mechanism to the users to manually enforce the configuration at any point of time. The configuration can only be enforced if it’s already configured using the API dcgmConfigSet.
If any of the properties can’t be enforced for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.
- Returns
DCGM_ST_OK if the configuration has been successfully enforced.
DCGM_ST_BADPARAM if groupId is invalid.
DCGM_ST_NOT_CONFIGURED if the target configuration is not already set.
DCGM_ST_GENERIC_ERROR if an unknown error has occurred.
-
dcgmReturn_t dcgmConfigSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfig_t *pDeviceConfig, dcgmStatus_t statusHandle)¶
Setup and Management¶
- group DCGMAPI_DC_Setup
Describes APIs to Get/Set configuration on the group of GPUs.
Functions
-
dcgmReturn_t dcgmConfigSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfig_t *pDeviceConfig, dcgmStatus_t statusHandle)
Used to set configuration for the group of one or more GPUs identified by groupId.
The configuration settings specified in pDeviceConfig are applied to all the GPUs in the group. Since DCGM group is a logical grouping of GPUs, the configuration settings stays intact for the individual GPUs even after the group is destroyed.
If the user wishes to ignore the configuration of one or more properties in the input pDeviceConfig then the property should be specified as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property to be ignored.
If any of the properties fail to be configured for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.
To find out valid supported clock values that can be passed to dcgmConfigSet, look at the device attributes of a GPU in the group using the API dcgmGetDeviceAttributes.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.
pDeviceConfig – IN: Pointer to memory to hold desired configuration to be applied for all the GPU in the group represented by groupId. The caller must populate the version field of pDeviceConfig.
statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.
- Returns
DCGM_ST_OK if the configuration has been successfully set.
DCGM_ST_BADPARAM if any of groupId or pDeviceConfig is invalid.
DCGM_ST_VER_MISMATCH if pDeviceConfig has the incorrect version.
DCGM_ST_GENERIC_ERROR if an unknown error has occurred.
-
dcgmReturn_t dcgmConfigGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfigType_t type, int count, dcgmConfig_t deviceConfigList[], dcgmStatus_t statusHandle)
Used to get configuration for all the GPUs present in the group.
This API can get the most recent target or desired configuration set by dcgmConfigSet. Set type as DCGM_CONFIG_TARGET_STATE to get target configuration. The target configuration properties are maintained by DCGM and are automatically enforced after a GPU reset or reinitialization is completed.
The method can also be used to get the actual configuration state for the GPUs in the group. Set type as DCGM_CONFIG_CURRENT_STATE to get the actually configuration state. Ideally, the actual configuration state will be exact same as the target configuration state.
If any of the property in the target configuration is unknown then the property value in the output is populated as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property.
If any of the property in the current configuration state is not supported then the property value in the output is populated as one of DCGM_INT32_NOT_SUPPORTED, DCGM_INT64_NOT_SUPPORTED, DCGM_FP64_NOT_SUPPORTED or DCGM_STR_NOT_SUPPORTED based on the data type of the property.
If any of the properties can’t be fetched for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.
type – IN: Type of configuration values to be fetched.
count – IN: The number of entries that deviceConfigList array can store.
deviceConfigList – OUT: Pointer to memory to hold requested configuration corresponding to all the GPUs in the group (groupId). The size of the memory must be greater than or equal to hold output information for the number of GPUs present in the group (groupId).
statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.
- Returns
DCGM_ST_OK if the configuration has been successfully fetched.
DCGM_ST_BADPARAM if any of groupId, type, count, or deviceConfigList is invalid.
DCGM_ST_NOT_CONFIGURED if the target configuration is not already set.
DCGM_ST_VER_MISMATCH if deviceConfigList has the incorrect version.
DCGM_ST_GENERIC_ERROR if an unknown error has occurred.
-
dcgmReturn_t dcgmConfigSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfig_t *pDeviceConfig, dcgmStatus_t statusHandle)
Manual Invocation¶
- group DCGMAPI_DC_MI
Describes APIs used to manually enforce the desired configuration on a group of GPUs.
Functions
-
dcgmReturn_t dcgmConfigEnforce(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmStatus_t statusHandle)
Used to enforce previously set configuration for all the GPUs present in the group.
This API provides a mechanism to the users to manually enforce the configuration at any point of time. The configuration can only be enforced if it’s already configured using the API dcgmConfigSet.
If any of the properties can’t be enforced for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.
- Returns
DCGM_ST_OK if the configuration has been successfully enforced.
DCGM_ST_BADPARAM if groupId is invalid.
DCGM_ST_NOT_CONFIGURED if the target configuration is not already set.
DCGM_ST_GENERIC_ERROR if an unknown error has occurred.
-
dcgmReturn_t dcgmConfigEnforce(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmStatus_t statusHandle)