Configuration

group DCGMAPI_DC

This chapter describes the methods that handle device configuration retrieval and default settings.

The APIs in Configuration module can be broken down into following categories:

Describes APIs to Get/Set configuration on the group of GPUs.

Functions

dcgmReturn_t dcgmConfigSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfig_t *pDeviceConfig, dcgmStatus_t statusHandle)

Used to set configuration for the group of one or more GPUs identified by groupId.

The configuration settings specified in pDeviceConfig are applied to all the GPUs in the group. Since DCGM group is a logical grouping of GPUs, the configuration settings stays intact for the individual GPUs even after the group is destroyed.

If the user wishes to ignore the configuration of one or more properties in the input pDeviceConfig then the property should be specified as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property to be ignored.

If any of the properties fail to be configured for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.

To find out valid supported clock values that can be passed to dcgmConfigSet, look at the device attributes of a GPU in the group using the API dcgmGetDeviceAttributes.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.

  • pDeviceConfig – IN: Pointer to memory to hold desired configuration to be applied for all the GPU in the group represented by groupId. The caller must populate the version field of pDeviceConfig.

  • statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.

Returns

dcgmReturn_t dcgmConfigGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfigType_t type, int count, dcgmConfig_t deviceConfigList[], dcgmStatus_t statusHandle)

Used to get configuration for all the GPUs present in the group.

This API can get the most recent target or desired configuration set by dcgmConfigSet. Set type as DCGM_CONFIG_TARGET_STATE to get target configuration. The target configuration properties are maintained by DCGM and are automatically enforced after a GPU reset or reinitialization is completed.

The method can also be used to get the actual configuration state for the GPUs in the group. Set type as DCGM_CONFIG_CURRENT_STATE to get the actually configuration state. Ideally, the actual configuration state will be exact same as the target configuration state.

If any of the property in the target configuration is unknown then the property value in the output is populated as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property.

If any of the property in the current configuration state is not supported then the property value in the output is populated as one of DCGM_INT32_NOT_SUPPORTED, DCGM_INT64_NOT_SUPPORTED, DCGM_FP64_NOT_SUPPORTED or DCGM_STR_NOT_SUPPORTED based on the data type of the property.

If any of the properties can’t be fetched for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.

  • type – IN: Type of configuration values to be fetched.

  • count – IN: The number of entries that deviceConfigList array can store.

  • deviceConfigList – OUT: Pointer to memory to hold requested configuration corresponding to all the GPUs in the group (groupId). The size of the memory must be greater than or equal to hold output information for the number of GPUs present in the group (groupId).

  • statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.

Returns

Describes APIs used to manually enforce the desired configuration on a group of GPUs.

Functions

dcgmReturn_t dcgmConfigEnforce(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmStatus_t statusHandle)

Used to enforce previously set configuration for all the GPUs present in the group.

This API provides a mechanism to the users to manually enforce the configuration at any point of time. The configuration can only be enforced if it’s already configured using the API dcgmConfigSet.

If any of the properties can’t be enforced for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.

Returns

Setup and Management

group DCGMAPI_DC_Setup

Describes APIs to Get/Set configuration on the group of GPUs.

Functions

dcgmReturn_t dcgmConfigSet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfig_t *pDeviceConfig, dcgmStatus_t statusHandle)

Used to set configuration for the group of one or more GPUs identified by groupId.

The configuration settings specified in pDeviceConfig are applied to all the GPUs in the group. Since DCGM group is a logical grouping of GPUs, the configuration settings stays intact for the individual GPUs even after the group is destroyed.

If the user wishes to ignore the configuration of one or more properties in the input pDeviceConfig then the property should be specified as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property to be ignored.

If any of the properties fail to be configured for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.

To find out valid supported clock values that can be passed to dcgmConfigSet, look at the device attributes of a GPU in the group using the API dcgmGetDeviceAttributes.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.

  • pDeviceConfig – IN: Pointer to memory to hold desired configuration to be applied for all the GPU in the group represented by groupId. The caller must populate the version field of pDeviceConfig.

  • statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.

Returns

dcgmReturn_t dcgmConfigGet(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmConfigType_t type, int count, dcgmConfig_t deviceConfigList[], dcgmStatus_t statusHandle)

Used to get configuration for all the GPUs present in the group.

This API can get the most recent target or desired configuration set by dcgmConfigSet. Set type as DCGM_CONFIG_TARGET_STATE to get target configuration. The target configuration properties are maintained by DCGM and are automatically enforced after a GPU reset or reinitialization is completed.

The method can also be used to get the actual configuration state for the GPUs in the group. Set type as DCGM_CONFIG_CURRENT_STATE to get the actually configuration state. Ideally, the actual configuration state will be exact same as the target configuration state.

If any of the property in the target configuration is unknown then the property value in the output is populated as one of DCGM_INT32_BLANK, DCGM_INT64_BLANK, DCGM_FP64_BLANK or DCGM_STR_BLANK based on the data type of the property.

If any of the property in the current configuration state is not supported then the property value in the output is populated as one of DCGM_INT32_NOT_SUPPORTED, DCGM_INT64_NOT_SUPPORTED, DCGM_FP64_NOT_SUPPORTED or DCGM_STR_NOT_SUPPORTED based on the data type of the property.

If any of the properties can’t be fetched for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group.

  • type – IN: Type of configuration values to be fetched.

  • count – IN: The number of entries that deviceConfigList array can store.

  • deviceConfigList – OUT: Pointer to memory to hold requested configuration corresponding to all the GPUs in the group (groupId). The size of the memory must be greater than or equal to hold output information for the number of GPUs present in the group (groupId).

  • statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.

Returns

Manual Invocation

group DCGMAPI_DC_MI

Describes APIs used to manually enforce the desired configuration on a group of GPUs.

Functions

dcgmReturn_t dcgmConfigEnforce(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmStatus_t statusHandle)

Used to enforce previously set configuration for all the GPUs present in the group.

This API provides a mechanism to the users to manually enforce the configuration at any point of time. The configuration can only be enforced if it’s already configured using the API dcgmConfigSet.

If any of the properties can’t be enforced for any of the GPUs in the group then the API returns an error. The status handle statusHandle should be further evaluated to access error attributes for the failed operations. Please refer to status management APIs at Status handling to access the error attributes.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • statusHandle – IN/OUT: Resulting error status for multiple operations. Pass it as NULL if the detailed error information is not needed. Look at dcgmStatusCreate for details on creating status handle.

Returns