System¶
- group DCGMAPI_SYS
This chapter describes the APIs used to identify set of GPUs on the node, grouping functions to provide mechanism to operate on a group of GPUs, and status management APIs in order to get individual statuses for each operation.
The APIs in System module can be broken down into following categories:
The following APIs are used to discover GPUs and their attributes on a Node.
Functions
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)¶
This method is used to get identifiers corresponding to all the devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().
- Parameters
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetAllSupportedDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)¶
This method is used to get identifiers corresponding to all the DCGM-supported devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().
- Parameters
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetDeviceAttributes(dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t *pDcgmAttr)¶
Gets device attributes corresponding to the gpuId.
If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.
- Parameters
pDcgmHandle – IN: DCGM Handle
gpuId – IN: GPU Id corresponding to which the attributes should be fetched
pDcgmAttr – IN/OUT: Device attributes corresponding to gpuId
.
pDcgmAttr->version should be set to
dcgmDeviceAttributes_version before this call.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if pDcgmAttr->version is not set or is invalid.
-
dcgmReturn_t dcgmGetEntityGroupEntities(dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t *entities, int *numEntities, unsigned int flags)¶
Gets the list of entities that exist for a given entity group.
This API can be used in place of dcgmGetAllDevices.
- Parameters
dcgmHandle – IN: DCGM Handle
entityGroup – IN: Entity group to list entities of
entities – OUT: Array of entities for entityGroup
numEntities – IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.
flags – IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* #defines in dcgm_structs.h
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_INSUFFICIENT_SIZE if numEntities was not large enough to hold the number of entities in the entityGroup. numEntities will contain the capacity needed to complete this request successfully.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetGpuInstanceHierarchy(dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2 *hierarchy)¶
Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent.
- Parameters
dcgmHandle – IN: DCGM Handle
entities – OUT: array of entities in the hierarchy
numEntities – IN/OUT: Upon calling, this should be the capacity of entities. Upon return, this will contain the number of entities actually saved to entities.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if the struct version is incorrect
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetNvLinkLinkStatus(dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v3 *linkStatus)¶
Get the NvLink link status for every NvLink in this system.
This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.
- Parameters
dcgmHandle – IN: DCGM Handle
linkStatus – OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
The following APIs are used for group management.
The user can create a group of entities and perform an operation on a group of entities. If grouping is not needed and the user wishes to run commands on all GPUs seen by DCGM then the user can use DCGM_GROUP_ALL_GPUS or DCGM_GROUP_ALL_NVSWITCHES in place of group IDs when needed.
Functions
-
dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)¶
Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId.
Instead of executing an operation separately for each entity, the DCGM group enables the user to execute same operation on all the entities present in the group as a single API call.
To create the group with all the entities present on the system, the type field should be specified as DCGM_GROUP_DEFAULT or DCGM_GROUP_ALL_NVSWITCHES. To create an empty group, the type field should be specified as DCGM_GROUP_EMPTY. The empty group can be updated with the desired set of entities using the APIs dcgmGroupAddDevice, dcgmGroupAddEntity, dcgmGroupRemoveDevice, and dcgmGroupRemoveEntity.
- Parameters
pDcgmHandle – IN: DCGM Handle
type – IN: Type of Entity Group to be formed
groupName – IN: Desired name of the GPU group specified as NULL terminated C string
pDcgmGrpId – OUT: Reference to group ID
- Returns
DCGM_ST_OK if the group has been created
DCGM_ST_BADPARAM if any of type, groupName, length or pDcgmGrpId is invalid
DCGM_ST_MAX_LIMIT if number of groups on the system has reached the max limit DCGM_MAX_NUM_GROUPS
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
-
dcgmReturn_t dcgmGroupDestroy(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId)¶
Used to destroy a group represented by groupId.
Since DCGM group is a logical grouping of entities, the properties applied on the group stay intact for the individual entities even after the group is destroyed.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID
- Returns
DCGM_ST_OK if the group has been destroyed
DCGM_ST_BADPARAM if groupId is invalid
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group does not exists
-
dcgmReturn_t dcgmGroupAddDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)¶
Used to add specified GPU Id to the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
gpuId – IN: DCGM GPU Id
- Returns
DCGM_ST_OK if the GPU Id has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupAddEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)¶
Used to add specified entity to the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns
DCGM_ST_OK if the entity has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupRemoveDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)¶
Used to remove specified GPU Id from the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
gpuId – IN: DCGM GPU Id
- Returns
DCGM_ST_OK if the GPU Id has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupRemoveEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)¶
Used to remove specified entity from the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns
DCGM_ST_OK if the entity has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupGetInfo(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t *pDcgmGroupInfo)¶
Used to get information corresponding to the group represented by groupId.
The information returned in pDcgmGroupInfo consists of group name, and the list of entities present in the group.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID for which information to be fetched
pDcgmGroupInfo – OUT: Group Information
- Returns
DCGM_ST_OK if the group info is successfully received.
DCGM_ST_BADPARAM if any of groupId or pDcgmGroupInfo is invalid.
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if the group does not contain the GPU
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
-
dcgmReturn_t dcgmGroupGetAllIds(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int *count)¶
Used to get the Ids of all groups of entities.
The information returned is a list of group ids in groupIdList as well as a count of how many ids there are in count. Please allocate enough memory for groupIdList. Memory of size MAX_NUM_GROUPS should be allocated for groupIdList.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupIdList – OUT: List of Group Ids
count – OUT: The number of Group ids in the list
- Returns
DCGM_ST_OK if the ids of the groups were successfully retrieved
DCGM_ST_BADPARAM if either of the groupIdList or count is null
DCGM_ST_GENERIC_ERROR if an unknown error has occurred
The following APIs are used for field group management.
The user can create a group of fields and perform an operation on a group of fields at once.
Functions
-
dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)¶
Used to create a group of fields and return the handle in dcgmFieldGroupId.
- Parameters
dcgmHandle – IN: DCGM handle
numFieldIds – IN: Number of field IDs that are being provided in fieldIds[]. Must be between 1 and DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP.
fieldIds – IN: Field IDs to be added to the newly-created field group
fieldGroupName – IN: Unique name for this group of fields. This must not be the same as any existing field groups.
dcgmFieldGroupId – OUT: Handle to the newly-created field group
- Returns
DCGM_ST_OK if the field group was successfully created.
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if too many field groups already exist
-
dcgmReturn_t dcgmFieldGroupDestroy(dcgmHandle_t dcgmHandle, dcgmFieldGrp_t dcgmFieldGroupId)¶
Used to remove a field group that was created with dcgmFieldGroupCreate.
- Parameters
dcgmHandle – IN: DCGM handle
dcgmFieldGroupId – IN: Field group to remove
- Returns
DCGM_ST_OK if the field group was successfully removed
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
-
dcgmReturn_t dcgmFieldGroupGetInfo(dcgmHandle_t dcgmHandle, dcgmFieldGroupInfo_t *fieldGroupInfo)¶
Used to get information about a field group that was created with dcgmFieldGroupCreate.
- Parameters
dcgmHandle – IN: DCGM handle
fieldGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmFieldGroupInfo_versionbefore this call
.fieldGroupId should contain the fieldGroupId you are interested in querying information for.
- Returns
DCGM_ST_OK if the field group info was returned successfully
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmFieldGroupGetAll(dcgmHandle_t dcgmHandle, dcgmAllFieldGroup_t *allGroupInfo)¶
Used to get information about all field groups in the system.
- Parameters
dcgmHandle – IN: DCGM handle
allGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmAllFieldGroup_version before this call.
- Returns
DCGM_ST_OK if the field group info was successfully returned
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
The following APIs are used to manage statuses for multiple operations on one or more GPUs.
Functions
-
dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)¶
Creates reference to DCGM status handler which can be used to get the statuses for multiple operations on one or more devices.
The multiple statuses are useful when the operations are performed at group level. The status handle provides a mechanism to access error attributes for the failed operations.
The number of errors stored behind the opaque handle can be accessed using the the API dcgmStatusGetCount. The errors are accessed from the opaque handle statusHandle using the API dcgmStatusPopError. The user can invoke dcgmStatusPopError for the number of errors or until all the errors are fetched.
When the status handle is not required any further then it should be deleted using the API dcgmStatusDestroy.
- Parameters
statusHandle – OUT: Reference to handle for list of statuses
- Returns
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusDestroy(dcgmStatus_t statusHandle)¶
Used to destroy status handle created using dcgmStatusCreate.
- Parameters
statusHandle – IN: Handle to list of statuses
- Returns
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusGetCount(dcgmStatus_t statusHandle, unsigned int *count)¶
Used to get count of error entries stored inside the opaque handle statusHandle.
- Parameters
statusHandle – IN: Handle to list of statuses
count – OUT: Number of error entries present in the list of statuses
- Returns
DCGM_ST_OK if the error count is successfully received
DCGM_ST_BADPARAM if any of statusHandle or count is invalid
-
dcgmReturn_t dcgmStatusPopError(dcgmStatus_t statusHandle, dcgmErrorInfo_t *pDcgmErrorInfo)¶
Used to iterate through the list of errors maintained behind statusHandle.
The method pops the first error from the list of DCGM statuses. In order to iterate through all the errors, the user can invoke this API for the number of errors or until all the errors are fetched.
- Parameters
statusHandle – IN: Handle to list of statuses
pDcgmErrorInfo – OUT: First error from the list of statuses
- Returns
DCGM_ST_OK if the error entry is successfully fetched
DCGM_ST_BADPARAM if any of statusHandle or pDcgmErrorInfo is invalid
DCGM_ST_NO_DATA if the status handle list is empty
-
dcgmReturn_t dcgmStatusClear(dcgmStatus_t statusHandle)¶
Used to clear all the errors in the status handle created by the API dcgmStatusCreate.
After one set of operation, the statusHandle can be cleared and reused for the next set of operation.
- Parameters
statusHandle – IN: Handle to list of statuses
- Returns
DCGM_ST_OK if the errors are successfully cleared
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)¶
Discovery¶
- group DCGM_DISCOVERY
The following APIs are used to discover GPUs and their attributes on a Node.
Functions
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
This method is used to get identifiers corresponding to all the devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().
- Parameters
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetAllSupportedDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
This method is used to get identifiers corresponding to all the DCGM-supported devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().
- Parameters
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetDeviceAttributes(dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t *pDcgmAttr)
Gets device attributes corresponding to the gpuId.
If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.
- Parameters
pDcgmHandle – IN: DCGM Handle
gpuId – IN: GPU Id corresponding to which the attributes should be fetched
pDcgmAttr – IN/OUT: Device attributes corresponding to gpuId
.
pDcgmAttr->version should be set to
dcgmDeviceAttributes_version before this call.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if pDcgmAttr->version is not set or is invalid.
-
dcgmReturn_t dcgmGetEntityGroupEntities(dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t *entities, int *numEntities, unsigned int flags)
Gets the list of entities that exist for a given entity group.
This API can be used in place of dcgmGetAllDevices.
- Parameters
dcgmHandle – IN: DCGM Handle
entityGroup – IN: Entity group to list entities of
entities – OUT: Array of entities for entityGroup
numEntities – IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.
flags – IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* #defines in dcgm_structs.h
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_INSUFFICIENT_SIZE if numEntities was not large enough to hold the number of entities in the entityGroup. numEntities will contain the capacity needed to complete this request successfully.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetGpuInstanceHierarchy(dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2 *hierarchy)
Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent.
- Parameters
dcgmHandle – IN: DCGM Handle
entities – OUT: array of entities in the hierarchy
numEntities – IN/OUT: Upon calling, this should be the capacity of entities. Upon return, this will contain the number of entities actually saved to entities.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if the struct version is incorrect
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetNvLinkLinkStatus(dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v3 *linkStatus)
Get the NvLink link status for every NvLink in this system.
This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.
- Parameters
dcgmHandle – IN: DCGM Handle
linkStatus – OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.
- Returns
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
Grouping¶
- group DCGM_GROUPING
The following APIs are used for group management.
The user can create a group of entities and perform an operation on a group of entities. If grouping is not needed and the user wishes to run commands on all GPUs seen by DCGM then the user can use DCGM_GROUP_ALL_GPUS or DCGM_GROUP_ALL_NVSWITCHES in place of group IDs when needed.
Functions
-
dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)
Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId.
Instead of executing an operation separately for each entity, the DCGM group enables the user to execute same operation on all the entities present in the group as a single API call.
To create the group with all the entities present on the system, the type field should be specified as DCGM_GROUP_DEFAULT or DCGM_GROUP_ALL_NVSWITCHES. To create an empty group, the type field should be specified as DCGM_GROUP_EMPTY. The empty group can be updated with the desired set of entities using the APIs dcgmGroupAddDevice, dcgmGroupAddEntity, dcgmGroupRemoveDevice, and dcgmGroupRemoveEntity.
- Parameters
pDcgmHandle – IN: DCGM Handle
type – IN: Type of Entity Group to be formed
groupName – IN: Desired name of the GPU group specified as NULL terminated C string
pDcgmGrpId – OUT: Reference to group ID
- Returns
DCGM_ST_OK if the group has been created
DCGM_ST_BADPARAM if any of type, groupName, length or pDcgmGrpId is invalid
DCGM_ST_MAX_LIMIT if number of groups on the system has reached the max limit DCGM_MAX_NUM_GROUPS
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
-
dcgmReturn_t dcgmGroupDestroy(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId)
Used to destroy a group represented by groupId.
Since DCGM group is a logical grouping of entities, the properties applied on the group stay intact for the individual entities even after the group is destroyed.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID
- Returns
DCGM_ST_OK if the group has been destroyed
DCGM_ST_BADPARAM if groupId is invalid
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group does not exists
-
dcgmReturn_t dcgmGroupAddDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)
Used to add specified GPU Id to the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
gpuId – IN: DCGM GPU Id
- Returns
DCGM_ST_OK if the GPU Id has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupAddEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)
Used to add specified entity to the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns
DCGM_ST_OK if the entity has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupRemoveDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)
Used to remove specified GPU Id from the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
gpuId – IN: DCGM GPU Id
- Returns
DCGM_ST_OK if the GPU Id has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupRemoveEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)
Used to remove specified entity from the group represented by groupId.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns
DCGM_ST_OK if the entity has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupGetInfo(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t *pDcgmGroupInfo)
Used to get information corresponding to the group represented by groupId.
The information returned in pDcgmGroupInfo consists of group name, and the list of entities present in the group.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID for which information to be fetched
pDcgmGroupInfo – OUT: Group Information
- Returns
DCGM_ST_OK if the group info is successfully received.
DCGM_ST_BADPARAM if any of groupId or pDcgmGroupInfo is invalid.
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if the group does not contain the GPU
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
-
dcgmReturn_t dcgmGroupGetAllIds(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int *count)
Used to get the Ids of all groups of entities.
The information returned is a list of group ids in groupIdList as well as a count of how many ids there are in count. Please allocate enough memory for groupIdList. Memory of size MAX_NUM_GROUPS should be allocated for groupIdList.
- Parameters
pDcgmHandle – IN: DCGM Handle
groupIdList – OUT: List of Group Ids
count – OUT: The number of Group ids in the list
- Returns
DCGM_ST_OK if the ids of the groups were successfully retrieved
DCGM_ST_BADPARAM if either of the groupIdList or count is null
DCGM_ST_GENERIC_ERROR if an unknown error has occurred
-
dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)
Field Grouping¶
- group DCGM_FIELD_GROUPING
The following APIs are used for field group management.
The user can create a group of fields and perform an operation on a group of fields at once.
Functions
-
dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)
Used to create a group of fields and return the handle in dcgmFieldGroupId.
- Parameters
dcgmHandle – IN: DCGM handle
numFieldIds – IN: Number of field IDs that are being provided in fieldIds[]. Must be between 1 and DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP.
fieldIds – IN: Field IDs to be added to the newly-created field group
fieldGroupName – IN: Unique name for this group of fields. This must not be the same as any existing field groups.
dcgmFieldGroupId – OUT: Handle to the newly-created field group
- Returns
DCGM_ST_OK if the field group was successfully created.
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if too many field groups already exist
-
dcgmReturn_t dcgmFieldGroupDestroy(dcgmHandle_t dcgmHandle, dcgmFieldGrp_t dcgmFieldGroupId)
Used to remove a field group that was created with dcgmFieldGroupCreate.
- Parameters
dcgmHandle – IN: DCGM handle
dcgmFieldGroupId – IN: Field group to remove
- Returns
DCGM_ST_OK if the field group was successfully removed
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
-
dcgmReturn_t dcgmFieldGroupGetInfo(dcgmHandle_t dcgmHandle, dcgmFieldGroupInfo_t *fieldGroupInfo)
Used to get information about a field group that was created with dcgmFieldGroupCreate.
- Parameters
dcgmHandle – IN: DCGM handle
fieldGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmFieldGroupInfo_versionbefore this call
.fieldGroupId should contain the fieldGroupId you are interested in querying information for.
- Returns
DCGM_ST_OK if the field group info was returned successfully
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmFieldGroupGetAll(dcgmHandle_t dcgmHandle, dcgmAllFieldGroup_t *allGroupInfo)
Used to get information about all field groups in the system.
- Parameters
dcgmHandle – IN: DCGM handle
allGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmAllFieldGroup_version before this call.
- Returns
DCGM_ST_OK if the field group info was successfully returned
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)
Status Handling¶
- group DCGMAPI_ST
The following APIs are used to manage statuses for multiple operations on one or more GPUs.
Functions
-
dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)
Creates reference to DCGM status handler which can be used to get the statuses for multiple operations on one or more devices.
The multiple statuses are useful when the operations are performed at group level. The status handle provides a mechanism to access error attributes for the failed operations.
The number of errors stored behind the opaque handle can be accessed using the the API dcgmStatusGetCount. The errors are accessed from the opaque handle statusHandle using the API dcgmStatusPopError. The user can invoke dcgmStatusPopError for the number of errors or until all the errors are fetched.
When the status handle is not required any further then it should be deleted using the API dcgmStatusDestroy.
- Parameters
statusHandle – OUT: Reference to handle for list of statuses
- Returns
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusDestroy(dcgmStatus_t statusHandle)
Used to destroy status handle created using dcgmStatusCreate.
- Parameters
statusHandle – IN: Handle to list of statuses
- Returns
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusGetCount(dcgmStatus_t statusHandle, unsigned int *count)
Used to get count of error entries stored inside the opaque handle statusHandle.
- Parameters
statusHandle – IN: Handle to list of statuses
count – OUT: Number of error entries present in the list of statuses
- Returns
DCGM_ST_OK if the error count is successfully received
DCGM_ST_BADPARAM if any of statusHandle or count is invalid
-
dcgmReturn_t dcgmStatusPopError(dcgmStatus_t statusHandle, dcgmErrorInfo_t *pDcgmErrorInfo)
Used to iterate through the list of errors maintained behind statusHandle.
The method pops the first error from the list of DCGM statuses. In order to iterate through all the errors, the user can invoke this API for the number of errors or until all the errors are fetched.
- Parameters
statusHandle – IN: Handle to list of statuses
pDcgmErrorInfo – OUT: First error from the list of statuses
- Returns
DCGM_ST_OK if the error entry is successfully fetched
DCGM_ST_BADPARAM if any of statusHandle or pDcgmErrorInfo is invalid
DCGM_ST_NO_DATA if the status handle list is empty
-
dcgmReturn_t dcgmStatusClear(dcgmStatus_t statusHandle)
Used to clear all the errors in the status handle created by the API dcgmStatusCreate.
After one set of operation, the statusHandle can be cleared and reused for the next set of operation.
- Parameters
statusHandle – IN: Handle to list of statuses
- Returns
DCGM_ST_OK if the errors are successfully cleared
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)