System
- group DCGMAPI_SYS
This chapter describes the APIs used to identify entities on the node, grouping functions to provide mechanism to operate on a group of entities, and status management APIs in order to get individual statuses for each operation.
The APIs in System module can be broken down into following categories:
The following APIs are used to discover GPUs and their attributes on a Node.
Functions
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
This method is used to get identifiers corresponding to all the devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().
- Parameters:
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetAllSupportedDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
This method is used to get identifiers corresponding to all the DCGM-supported devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().
- Parameters:
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetDeviceAttributes(dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t *pDcgmAttr)
Gets device attributes corresponding to the gpuId.
If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.
- Parameters:
pDcgmHandle – IN: DCGM Handle
gpuId – IN: GPU Id corresponding to which the attributes should be fetched
pDcgmAttr – IN/OUT: Device attributes corresponding to gpuId
.
pDcgmAttr->version should be set to
dcgmDeviceAttributes_version before this call.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if pDcgmAttr->version is not set or is invalid.
-
dcgmReturn_t dcgmGetEntityGroupEntities(dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t *entities, int *numEntities, unsigned int flags)
Gets the list of entities that exist for a given entity group.
This API can be used in place of dcgmGetAllDevices.
- Parameters:
dcgmHandle – IN: DCGM Handle
entityGroup – IN: Entity group to list entities of
entities – OUT: Array of entities for entityGroup
numEntities – IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.
flags – IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* #defines in dcgm_structs.h
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_INSUFFICIENT_SIZE if numEntities was not large enough to hold the number of entities in the entityGroup. numEntities will contain the capacity needed to complete this request successfully.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetGpuInstanceHierarchy(dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2 *hierarchy)
Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent.
- Parameters:
dcgmHandle – IN: DCGM Handle
entities – OUT: array of entities in the hierarchy
numEntities – IN/OUT: Upon calling, this should be the capacity of entities. Upon return, this will contain the number of entities actually saved to entities.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if the struct version is incorrect
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetNvLinkLinkStatus(dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v4 *linkStatus)
Get the NvLink link status for every NvLink in this system.
This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.
- Parameters:
dcgmHandle – IN: DCGM Handle
linkStatus – OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetCpuHierarchy(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v1 *cpuHierarchy)
List supported CPUs and their cores present on the system.
This and other CPU APIs only support datacenter NVIDIA CPUs. Use dcgmGetCpuHierarchy_v2 to get additional CPU information.
- Parameters:
dcgmHandle – IN: DCGM Handle
cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the device is unsupported
DCGM_ST_MODULE_NOT_LOADED if the sysmon module could not be loaded
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetCpuHierarchy_v2(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v2 *cpuHierarchy)
List supported CPUs and their cores present on the system.
This and other CPU APIs only support datacenter NVIDIA CPUs.
- Parameters:
dcgmHandle – IN: DCGM Handle
cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the device is unsupported
DCGM_ST_MODULE_NOT_LOADED if the sysmon module could not be loaded
DCGM_ST_BADPARAM if any parameter is invalid
The following APIs are used for group management.
The user can create a group of entities and perform an operation on a group of entities. If grouping is not needed and the user wishes to run commands on all GPUs seen by DCGM then the user can use DCGM_GROUP_ALL_GPUS or DCGM_GROUP_ALL_NVSWITCHES in place of group IDs when needed.
Functions
-
dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)
Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId.
Instead of executing an operation separately for each entity, the DCGM group enables the user to execute same operation on all the entities present in the group as a single API call.
To create the group with all the entities present on the system, the type field should be specified as DCGM_GROUP_DEFAULT or DCGM_GROUP_ALL_NVSWITCHES. To create an empty group, the type field should be specified as DCGM_GROUP_EMPTY. The empty group can be updated with the desired set of entities using the APIs dcgmGroupAddDevice, dcgmGroupAddEntity, dcgmGroupRemoveDevice, and dcgmGroupRemoveEntity.
- Parameters:
pDcgmHandle – IN: DCGM Handle
type – IN: Type of Entity Group to be formed
groupName – IN: Desired name of the GPU group specified as NULL terminated C string
pDcgmGrpId – OUT: Reference to group ID
- Returns:
DCGM_ST_OK if the group has been created
DCGM_ST_BADPARAM if any of type, groupName, length or pDcgmGrpId is invalid
DCGM_ST_MAX_LIMIT if number of groups on the system has reached the max limit DCGM_MAX_NUM_GROUPS
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
-
dcgmReturn_t dcgmGroupDestroy(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId)
Used to destroy a group represented by groupId.
Since DCGM group is a logical grouping of entities, the properties applied on the group stay intact for the individual entities even after the group is destroyed.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID
- Returns:
DCGM_ST_OK if the group has been destroyed
DCGM_ST_BADPARAM if groupId is invalid
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group does not exists
-
dcgmReturn_t dcgmGroupAddDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)
Used to add specified GPU Id to the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
gpuId – IN: DCGM GPU Id
- Returns:
DCGM_ST_OK if the GPU Id has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupAddEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)
Used to add specified entity to the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns:
DCGM_ST_OK if the entity has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupRemoveDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)
Used to remove specified GPU Id from the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
gpuId – IN: DCGM GPU Id
- Returns:
DCGM_ST_OK if the GPU Id has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupRemoveEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)
Used to remove specified entity from the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns:
DCGM_ST_OK if the entity has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupGetInfo(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t *pDcgmGroupInfo)
Used to get information corresponding to the group represented by groupId.
The information returned in pDcgmGroupInfo consists of group name, and the list of entities present in the group.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID for which information to be fetched
pDcgmGroupInfo – OUT: Group Information
- Returns:
DCGM_ST_OK if the group info is successfully received.
DCGM_ST_BADPARAM if any of groupId or pDcgmGroupInfo is invalid.
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if the group does not contain the GPU
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
-
dcgmReturn_t dcgmGroupGetAllIds(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int *count)
Used to get the Ids of all groups of entities.
The information returned is a list of group ids in groupIdList as well as a count of how many ids there are in count. Please allocate enough memory for groupIdList. Memory of size MAX_NUM_GROUPS should be allocated for groupIdList.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupIdList – OUT: List of Group Ids
count – OUT: The number of Group ids in the list
- Returns:
DCGM_ST_OK if the ids of the groups were successfully retrieved
DCGM_ST_BADPARAM if either of the groupIdList or count is null
DCGM_ST_GENERIC_ERROR if an unknown error has occurred
The following APIs are used for field group management.
The user can create a group of fields and perform an operation on a group of fields at once.
Functions
-
dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)
Used to create a group of fields and return the handle in dcgmFieldGroupId.
- Parameters:
dcgmHandle – IN: DCGM handle
numFieldIds – IN: Number of field IDs that are being provided in fieldIds[]. Must be between 1 and DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP.
fieldIds – IN: Field IDs to be added to the newly-created field group
fieldGroupName – IN: Unique name for this group of fields. This must not be the same as any existing field groups.
dcgmFieldGroupId – OUT: Handle to the newly-created field group
- Returns:
DCGM_ST_OK if the field group was successfully created.
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if too many field groups already exist
-
dcgmReturn_t dcgmFieldGroupDestroy(dcgmHandle_t dcgmHandle, dcgmFieldGrp_t dcgmFieldGroupId)
Used to remove a field group that was created with dcgmFieldGroupCreate.
- Parameters:
dcgmHandle – IN: DCGM handle
dcgmFieldGroupId – IN: Field group to remove
- Returns:
DCGM_ST_OK if the field group was successfully removed
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
-
dcgmReturn_t dcgmFieldGroupGetInfo(dcgmHandle_t dcgmHandle, dcgmFieldGroupInfo_t *fieldGroupInfo)
Used to get information about a field group that was created with dcgmFieldGroupCreate.
- Parameters:
dcgmHandle – IN: DCGM handle
fieldGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmFieldGroupInfo_versionbefore this call
.fieldGroupId should contain the fieldGroupId you are interested in querying information for.
- Returns:
DCGM_ST_OK if the field group info was returned successfully
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmFieldGroupGetAll(dcgmHandle_t dcgmHandle, dcgmAllFieldGroup_t *allGroupInfo)
Used to get information about all field groups in the system.
- Parameters:
dcgmHandle – IN: DCGM handle
allGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmAllFieldGroup_version before this call.
- Returns:
DCGM_ST_OK if the field group info was successfully returned
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
The following APIs are used to manage statuses for multiple operations on one or more GPUs.
Functions
-
dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)
Creates reference to DCGM status handler which can be used to get the statuses for multiple operations on one or more devices.
The multiple statuses are useful when the operations are performed at group level. The status handle provides a mechanism to access error attributes for the failed operations.
The number of errors stored behind the opaque handle can be accessed using the the API dcgmStatusGetCount. The errors are accessed from the opaque handle statusHandle using the API dcgmStatusPopError. The user can invoke dcgmStatusPopError for the number of errors or until all the errors are fetched.
When the status handle is not required any further then it should be deleted using the API dcgmStatusDestroy.
- Parameters:
statusHandle – OUT: Reference to handle for list of statuses
- Returns:
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusDestroy(dcgmStatus_t statusHandle)
Used to destroy status handle created using dcgmStatusCreate.
- Parameters:
statusHandle – IN: Handle to list of statuses
- Returns:
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusGetCount(dcgmStatus_t statusHandle, unsigned int *count)
Used to get count of error entries stored inside the opaque handle statusHandle.
- Parameters:
statusHandle – IN: Handle to list of statuses
count – OUT: Number of error entries present in the list of statuses
- Returns:
DCGM_ST_OK if the error count is successfully received
DCGM_ST_BADPARAM if any of statusHandle or count is invalid
-
dcgmReturn_t dcgmStatusPopError(dcgmStatus_t statusHandle, dcgmErrorInfo_t *pDcgmErrorInfo)
Used to iterate through the list of errors maintained behind statusHandle.
The method pops the first error from the list of DCGM statuses. In order to iterate through all the errors, the user can invoke this API for the number of errors or until all the errors are fetched.
- Parameters:
statusHandle – IN: Handle to list of statuses
pDcgmErrorInfo – OUT: First error from the list of statuses
- Returns:
DCGM_ST_OK if the error entry is successfully fetched
DCGM_ST_BADPARAM if any of statusHandle or pDcgmErrorInfo is invalid
DCGM_ST_NO_DATA if the status handle list is empty
-
dcgmReturn_t dcgmStatusClear(dcgmStatus_t statusHandle)
Used to clear all the errors in the status handle created by the API dcgmStatusCreate.
After one set of operation, the statusHandle can be cleared and reused for the next set of operation.
- Parameters:
statusHandle – IN: Handle to list of statuses
- Returns:
DCGM_ST_OK if the errors are successfully cleared
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
Discovery
- group DCGM_DISCOVERY
The following APIs are used to discover GPUs and their attributes on a Node.
Functions
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
This method is used to get identifiers corresponding to all the devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().
- Parameters:
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetAllSupportedDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
This method is used to get identifiers corresponding to all the DCGM-supported devices on the system.
The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().
- Parameters:
pDcgmHandle – IN: DCGM Handle
gpuIdList – OUT: Array reference to fill GPU Ids present on the system.
count – OUT: Number of GPUs returned in gpuIdList.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_BADPARAM if gpuIdList or count were not valid.
-
dcgmReturn_t dcgmGetDeviceAttributes(dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t *pDcgmAttr)
Gets device attributes corresponding to the gpuId.
If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.
- Parameters:
pDcgmHandle – IN: DCGM Handle
gpuId – IN: GPU Id corresponding to which the attributes should be fetched
pDcgmAttr – IN/OUT: Device attributes corresponding to gpuId
.
pDcgmAttr->version should be set to
dcgmDeviceAttributes_version before this call.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if pDcgmAttr->version is not set or is invalid.
-
dcgmReturn_t dcgmGetEntityGroupEntities(dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t *entities, int *numEntities, unsigned int flags)
Gets the list of entities that exist for a given entity group.
This API can be used in place of dcgmGetAllDevices.
- Parameters:
dcgmHandle – IN: DCGM Handle
entityGroup – IN: Entity group to list entities of
entities – OUT: Array of entities for entityGroup
numEntities – IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.
flags – IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* #defines in dcgm_structs.h
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_INSUFFICIENT_SIZE if numEntities was not large enough to hold the number of entities in the entityGroup. numEntities will contain the capacity needed to complete this request successfully.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetGpuInstanceHierarchy(dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2 *hierarchy)
Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent.
- Parameters:
dcgmHandle – IN: DCGM Handle
entities – OUT: array of entities in the hierarchy
numEntities – IN/OUT: Upon calling, this should be the capacity of entities. Upon return, this will contain the number of entities actually saved to entities.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_VER_MISMATCH if the struct version is incorrect
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetNvLinkLinkStatus(dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v4 *linkStatus)
Get the NvLink link status for every NvLink in this system.
This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.
- Parameters:
dcgmHandle – IN: DCGM Handle
linkStatus – OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetCpuHierarchy(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v1 *cpuHierarchy)
List supported CPUs and their cores present on the system.
This and other CPU APIs only support datacenter NVIDIA CPUs. Use dcgmGetCpuHierarchy_v2 to get additional CPU information.
- Parameters:
dcgmHandle – IN: DCGM Handle
cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the device is unsupported
DCGM_ST_MODULE_NOT_LOADED if the sysmon module could not be loaded
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetCpuHierarchy_v2(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v2 *cpuHierarchy)
List supported CPUs and their cores present on the system.
This and other CPU APIs only support datacenter NVIDIA CPUs.
- Parameters:
dcgmHandle – IN: DCGM Handle
cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated
- Returns:
DCGM_ST_OK if the call was successful.
DCGM_ST_NOT_SUPPORTED if the device is unsupported
DCGM_ST_MODULE_NOT_LOADED if the sysmon module could not be loaded
DCGM_ST_BADPARAM if any parameter is invalid
-
dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)
Grouping
- group DCGM_GROUPING
The following APIs are used for group management.
The user can create a group of entities and perform an operation on a group of entities. If grouping is not needed and the user wishes to run commands on all GPUs seen by DCGM then the user can use DCGM_GROUP_ALL_GPUS or DCGM_GROUP_ALL_NVSWITCHES in place of group IDs when needed.
Functions
-
dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)
Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId.
Instead of executing an operation separately for each entity, the DCGM group enables the user to execute same operation on all the entities present in the group as a single API call.
To create the group with all the entities present on the system, the type field should be specified as DCGM_GROUP_DEFAULT or DCGM_GROUP_ALL_NVSWITCHES. To create an empty group, the type field should be specified as DCGM_GROUP_EMPTY. The empty group can be updated with the desired set of entities using the APIs dcgmGroupAddDevice, dcgmGroupAddEntity, dcgmGroupRemoveDevice, and dcgmGroupRemoveEntity.
- Parameters:
pDcgmHandle – IN: DCGM Handle
type – IN: Type of Entity Group to be formed
groupName – IN: Desired name of the GPU group specified as NULL terminated C string
pDcgmGrpId – OUT: Reference to group ID
- Returns:
DCGM_ST_OK if the group has been created
DCGM_ST_BADPARAM if any of type, groupName, length or pDcgmGrpId is invalid
DCGM_ST_MAX_LIMIT if number of groups on the system has reached the max limit DCGM_MAX_NUM_GROUPS
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
-
dcgmReturn_t dcgmGroupDestroy(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId)
Used to destroy a group represented by groupId.
Since DCGM group is a logical grouping of entities, the properties applied on the group stay intact for the individual entities even after the group is destroyed.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID
- Returns:
DCGM_ST_OK if the group has been destroyed
DCGM_ST_BADPARAM if groupId is invalid
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group does not exists
-
dcgmReturn_t dcgmGroupAddDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)
Used to add specified GPU Id to the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
gpuId – IN: DCGM GPU Id
- Returns:
DCGM_ST_OK if the GPU Id has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupAddEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)
Used to add specified entity to the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group Id to which device should be added
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns:
DCGM_ST_OK if the entity has been successfully added to the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or already part of the specified group
-
dcgmReturn_t dcgmGroupRemoveDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)
Used to remove specified GPU Id from the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
gpuId – IN: DCGM GPU Id
- Returns:
DCGM_ST_OK if the GPU Id has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if gpuId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupRemoveEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)
Used to remove specified entity from the group represented by groupId.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID from which device should be removed
entityGroupId – IN: Entity group that entityId belongs to
entityId – IN: DCGM entityId
- Returns:
DCGM_ST_OK if the entity has been successfully removed from the group
DCGM_ST_INIT_ERROR if the library has not been successfully initialized
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
DCGM_ST_BADPARAM if entityId is invalid or not part of the specified group
-
dcgmReturn_t dcgmGroupGetInfo(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t *pDcgmGroupInfo)
Used to get information corresponding to the group represented by groupId.
The information returned in pDcgmGroupInfo consists of group name, and the list of entities present in the group.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupId – IN: Group ID for which information to be fetched
pDcgmGroupInfo – OUT: Group Information
- Returns:
DCGM_ST_OK if the group info is successfully received.
DCGM_ST_BADPARAM if any of groupId or pDcgmGroupInfo is invalid.
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if the group does not contain the GPU
DCGM_ST_NOT_CONFIGURED if entry corresponding to the group (groupId) does not exists
-
dcgmReturn_t dcgmGroupGetAllIds(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int *count)
Used to get the Ids of all groups of entities.
The information returned is a list of group ids in groupIdList as well as a count of how many ids there are in count. Please allocate enough memory for groupIdList. Memory of size MAX_NUM_GROUPS should be allocated for groupIdList.
- Parameters:
pDcgmHandle – IN: DCGM Handle
groupIdList – OUT: List of Group Ids
count – OUT: The number of Group ids in the list
- Returns:
DCGM_ST_OK if the ids of the groups were successfully retrieved
DCGM_ST_BADPARAM if either of the groupIdList or count is null
DCGM_ST_GENERIC_ERROR if an unknown error has occurred
-
dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)
Field Grouping
- group DCGM_FIELD_GROUPING
The following APIs are used for field group management.
The user can create a group of fields and perform an operation on a group of fields at once.
Functions
-
dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)
Used to create a group of fields and return the handle in dcgmFieldGroupId.
- Parameters:
dcgmHandle – IN: DCGM handle
numFieldIds – IN: Number of field IDs that are being provided in fieldIds[]. Must be between 1 and DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP.
fieldIds – IN: Field IDs to be added to the newly-created field group
fieldGroupName – IN: Unique name for this group of fields. This must not be the same as any existing field groups.
dcgmFieldGroupId – OUT: Handle to the newly-created field group
- Returns:
DCGM_ST_OK if the field group was successfully created.
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_MAX_LIMIT if too many field groups already exist
-
dcgmReturn_t dcgmFieldGroupDestroy(dcgmHandle_t dcgmHandle, dcgmFieldGrp_t dcgmFieldGroupId)
Used to remove a field group that was created with dcgmFieldGroupCreate.
- Parameters:
dcgmHandle – IN: DCGM handle
dcgmFieldGroupId – IN: Field group to remove
- Returns:
DCGM_ST_OK if the field group was successfully removed
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
-
dcgmReturn_t dcgmFieldGroupGetInfo(dcgmHandle_t dcgmHandle, dcgmFieldGroupInfo_t *fieldGroupInfo)
Used to get information about a field group that was created with dcgmFieldGroupCreate.
- Parameters:
dcgmHandle – IN: DCGM handle
fieldGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmFieldGroupInfo_versionbefore this call
.fieldGroupId should contain the fieldGroupId you are interested in querying information for.
- Returns:
DCGM_ST_OK if the field group info was returned successfully
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmFieldGroupGetAll(dcgmHandle_t dcgmHandle, dcgmAllFieldGroup_t *allGroupInfo)
Used to get information about all field groups in the system.
- Parameters:
dcgmHandle – IN: DCGM handle
allGroupInfo –
IN/OUT: Info about all of the field groups that exist.
.version should be set to
dcgmAllFieldGroup_version before this call.
- Returns:
DCGM_ST_OK if the field group info was successfully returned
DCGM_ST_BADPARAM if any parameters were bad
DCGM_ST_INIT_ERROR if the library has not been successfully initialized.
DCGM_ST_VER_MISMATCH if .version is not set or is invalid.
-
dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)
Status Handling
- group DCGMAPI_ST
The following APIs are used to manage statuses for multiple operations on one or more GPUs.
Functions
-
dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)
Creates reference to DCGM status handler which can be used to get the statuses for multiple operations on one or more devices.
The multiple statuses are useful when the operations are performed at group level. The status handle provides a mechanism to access error attributes for the failed operations.
The number of errors stored behind the opaque handle can be accessed using the the API dcgmStatusGetCount. The errors are accessed from the opaque handle statusHandle using the API dcgmStatusPopError. The user can invoke dcgmStatusPopError for the number of errors or until all the errors are fetched.
When the status handle is not required any further then it should be deleted using the API dcgmStatusDestroy.
- Parameters:
statusHandle – OUT: Reference to handle for list of statuses
- Returns:
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusDestroy(dcgmStatus_t statusHandle)
Used to destroy status handle created using dcgmStatusCreate.
- Parameters:
statusHandle – IN: Handle to list of statuses
- Returns:
DCGM_ST_OK if the status handle is successfully created
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusGetCount(dcgmStatus_t statusHandle, unsigned int *count)
Used to get count of error entries stored inside the opaque handle statusHandle.
- Parameters:
statusHandle – IN: Handle to list of statuses
count – OUT: Number of error entries present in the list of statuses
- Returns:
DCGM_ST_OK if the error count is successfully received
DCGM_ST_BADPARAM if any of statusHandle or count is invalid
-
dcgmReturn_t dcgmStatusPopError(dcgmStatus_t statusHandle, dcgmErrorInfo_t *pDcgmErrorInfo)
Used to iterate through the list of errors maintained behind statusHandle.
The method pops the first error from the list of DCGM statuses. In order to iterate through all the errors, the user can invoke this API for the number of errors or until all the errors are fetched.
- Parameters:
statusHandle – IN: Handle to list of statuses
pDcgmErrorInfo – OUT: First error from the list of statuses
- Returns:
DCGM_ST_OK if the error entry is successfully fetched
DCGM_ST_BADPARAM if any of statusHandle or pDcgmErrorInfo is invalid
DCGM_ST_NO_DATA if the status handle list is empty
-
dcgmReturn_t dcgmStatusClear(dcgmStatus_t statusHandle)
Used to clear all the errors in the status handle created by the API dcgmStatusCreate.
After one set of operation, the statusHandle can be cleared and reused for the next set of operation.
- Parameters:
statusHandle – IN: Handle to list of statuses
- Returns:
DCGM_ST_OK if the errors are successfully cleared
DCGM_ST_BADPARAM if statusHandle is invalid
-
dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)