2.2.2. Grouping
[System]
The following APIs are used for group management. The user can create a group of GPUs and perform an operation on a group of GPUs. If grouping is not needed and the user wishes to run commands on all GPUs seen by DCGM then the user can use DCGM_GROUP_ALL_GPUS in place of group IDs when needed.
Functions
- dcgmReturn_t dcgmGroupAddDevice ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId )
- dcgmReturn_t dcgmGroupCreate ( dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, char* groupName, dcgmGpuGrp_t* pDcgmGrpId )
- dcgmReturn_t dcgmGroupDestroy ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId )
- dcgmReturn_t dcgmGroupGetAllIds ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int* count )
- dcgmReturn_t dcgmGroupGetInfo ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t* pDcgmGroupInfo )
- dcgmReturn_t dcgmGroupRemoveDevice ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId )
Functions
- dcgmReturn_t dcgmGroupAddDevice ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- groupId
- IN : Group Id to which device should be added
- gpuId
- IN : DCGM GPU Id
Returns
- DCGM_ST_OK if the GPU Id has been successfully added to the group
- DCGM_ST_INIT_ERROR if the library has not been successfully initialized
- DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
- DCGM_ST_BADPARAM if gpuId is invalid or already part of the specified group
Description
Used to add specified GPU Id to the group represented by groupId.
- dcgmReturn_t dcgmGroupCreate ( dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, char* groupName, dcgmGpuGrp_t* pDcgmGrpId )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- type
- IN : Type of GPU Group to be formed
- groupName
- IN : Desired name of the GPU group specified as NULL terminated C string
- pDcgmGrpId
- OUT : Reference to group ID
Returns
- DCGM_ST_OK if the group has been created
- DCGM_ST_BADPARAM if any of type, groupName, length or pDcgmGrpId is invalid
- DCGM_ST_MAX_LIMIT if number of groups on the system has reached the max limit DCGM_MAX_NUM_GROUPS
- DCGM_ST_INIT_ERROR if the library has not been successfully initialized
Description
Used to create a GPU group handle which can store one or more GPU Ids as an opaque handle returned in pDcgmGrpId. Instead of executing an operation separately for each GPU, the DCGM group enables the user to execute same operation on all the GPUs present in the group as a single API call.
To create the group with all the GPUs present on the system, the type field should be specified as DCGM_GROUP_DEFAULT. To create an empty group, the type field should be specified as DCGM_GROUP_EMPTY. The empty group can be updated with the desired set of GPUs using the APIs dcgmGroupAddDevice and dcgmGroupRemoveDevice.
- dcgmReturn_t dcgmGroupDestroy ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- groupId
- IN : Group ID
Returns
- DCGM_ST_OK if the group has been destroyed
- DCGM_ST_BADPARAM if groupId is invalid
- DCGM_ST_INIT_ERROR if the library has not been successfully initialized
- DCGM_ST_NOT_CONFIGURED if entry corresponding to the group does not exists
Description
Used to destroy a group represented by groupId. Since DCGM group is a logical grouping of GPUs, the properties applied on the group stay intact for the individual GPUs even after the group is destroyed.
- dcgmReturn_t dcgmGroupGetAllIds ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int* count )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- groupIdList
- OUT : List of Group Ids
- count
- OUT : The number of Group ids in the list
Returns
- DCGM_ST_OK if the ids of the groups were successfully retrieved
- DCGM_ST_BADPARAM if either of the groupIdList or count is null
- DCGM_ST_GENERIC_ERROR if an unknown error has occurred
Description
Used to get the Ids of all groups of GPUs. The information returned is a list of GPU group ids in groupIdList as well as a count of how many ids there are in count. Please allocate enough memory for groupIdList. Memory of size MAX_NUM_GROUPS should be allocated for groupIdList.
- dcgmReturn_t dcgmGroupGetInfo ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t* pDcgmGroupInfo )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- groupId
- IN : Group ID for which information to be fetched
- pDcgmGroupInfo
- OUT : Group Information
Returns
- DCGM_ST_OK if the group info is successfully received.
- DCGM_ST_BADPARAM if any of groupId or pDcgmGroupInfo is invalid.
- DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
- DCGM_ST_MAX_LIMIT if the group does not contain the GPU
- DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
Description
Used to get information corresponding to the group represented by groupId. The information returned in pDcgmGroupInfo consists of group name, and the list of GPU IDs present in the group.
- dcgmReturn_t dcgmGroupRemoveDevice ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- groupId
- IN : Group ID from which device should be removed
- gpuId
- IN : DCGM GPU Id
Returns
- DCGM_ST_OK if the GPU Id has been successfully removed from the group
- DCGM_ST_INIT_ERROR if the library has not been successfully initialized
- DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
- DCGM_ST_BADPARAM if gpuId is invalid or not part of the specified group
Description
Used to remove specified GPU Id from the group represented by groupId.