System

group DCGMAPI_SYS

This chapter describes the APIs used to identify entities on the node, grouping functions to provide mechanism to operate on a group of entities, and status management APIs in order to get individual statuses for each operation.

The APIs in System module can be broken down into following categories:

The following APIs are used to discover GPUs and their attributes on a Node.

Functions

dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)

This method is used to get identifiers corresponding to all the devices on the system.

The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.

The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • gpuIdList – OUT: Array reference to fill GPU Ids present on the system.

  • count – OUT: Number of GPUs returned in gpuIdList.

Returns:

dcgmReturn_t dcgmGetAllSupportedDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)

This method is used to get identifiers corresponding to all the DCGM-supported devices on the system.

The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.

The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • gpuIdList – OUT: Array reference to fill GPU Ids present on the system.

  • count – OUT: Number of GPUs returned in gpuIdList.

Returns:

dcgmReturn_t dcgmGetDeviceAttributes(dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t *pDcgmAttr)

Gets device attributes corresponding to the gpuId.

If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • gpuId – IN: GPU Id corresponding to which the attributes should be fetched

  • pDcgmAttr – IN/OUT: Device attributes corresponding to gpuId

    .

    pDcgmAttr->version should be set to

    dcgmDeviceAttributes_version before this call.

Returns:

dcgmReturn_t dcgmGetEntityGroupEntities(dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t *entities, int *numEntities, unsigned int flags)

Gets the list of entities that exist for a given entity group.

This API can be used in place of dcgmGetAllDevices.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • entityGroup – IN: Entity group to list entities of

  • entities – OUT: Array of entities for entityGroup

  • numEntities – IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.

  • flags – IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* #defines in dcgm_structs.h

Returns:

dcgmReturn_t dcgmGetGpuInstanceHierarchy(dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2 *hierarchy)

Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • entities – OUT: array of entities in the hierarchy

  • numEntities – IN/OUT: Upon calling, this should be the capacity of entities. Upon return, this will contain the number of entities actually saved to entities.

Returns:

dcgmReturn_t dcgmGetNvLinkLinkStatus(dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v4 *linkStatus)

Get the NvLink link status for every NvLink in this system.

This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • linkStatus – OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.

Returns:

dcgmReturn_t dcgmGetCpuHierarchy(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v1 *cpuHierarchy)

List supported CPUs and their cores present on the system.

This and other CPU APIs only support datacenter NVIDIA CPUs. Use dcgmGetCpuHierarchy_v2 to get additional CPU information.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated

Returns:

dcgmReturn_t dcgmGetCpuHierarchy_v2(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v2 *cpuHierarchy)

List supported CPUs and their cores present on the system.

This and other CPU APIs only support datacenter NVIDIA CPUs.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated

Returns:

The following APIs are used for group management.

The user can create a group of entities and perform an operation on a group of entities. If grouping is not needed and the user wishes to run commands on all GPUs seen by DCGM then the user can use DCGM_GROUP_ALL_GPUS or DCGM_GROUP_ALL_NVSWITCHES in place of group IDs when needed.

Functions

dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)

Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId.

Instead of executing an operation separately for each entity, the DCGM group enables the user to execute same operation on all the entities present in the group as a single API call.

To create the group with all the entities present on the system, the type field should be specified as DCGM_GROUP_DEFAULT or DCGM_GROUP_ALL_NVSWITCHES. To create an empty group, the type field should be specified as DCGM_GROUP_EMPTY. The empty group can be updated with the desired set of entities using the APIs dcgmGroupAddDevice, dcgmGroupAddEntity, dcgmGroupRemoveDevice, and dcgmGroupRemoveEntity.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • type – IN: Type of Entity Group to be formed

  • groupName – IN: Desired name of the GPU group specified as NULL terminated C string

  • pDcgmGrpId – OUT: Reference to group ID

Returns:

dcgmReturn_t dcgmGroupDestroy(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId)

Used to destroy a group represented by groupId.

Since DCGM group is a logical grouping of entities, the properties applied on the group stay intact for the individual entities even after the group is destroyed.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID

Returns:

dcgmReturn_t dcgmGroupAddDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)

Used to add specified GPU Id to the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group Id to which device should be added

  • gpuId – IN: DCGM GPU Id

Returns:

dcgmReturn_t dcgmGroupAddEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)

Used to add specified entity to the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group Id to which device should be added

  • entityGroupId – IN: Entity group that entityId belongs to

  • entityId – IN: DCGM entityId

Returns:

dcgmReturn_t dcgmGroupRemoveDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)

Used to remove specified GPU Id from the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID from which device should be removed

  • gpuId – IN: DCGM GPU Id

Returns:

dcgmReturn_t dcgmGroupRemoveEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)

Used to remove specified entity from the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID from which device should be removed

  • entityGroupId – IN: Entity group that entityId belongs to

  • entityId – IN: DCGM entityId

Returns:

dcgmReturn_t dcgmGroupGetInfo(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t *pDcgmGroupInfo)

Used to get information corresponding to the group represented by groupId.

The information returned in pDcgmGroupInfo consists of group name, and the list of entities present in the group.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID for which information to be fetched

  • pDcgmGroupInfo – OUT: Group Information

Returns:

dcgmReturn_t dcgmGroupGetAllIds(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int *count)

Used to get the Ids of all groups of entities.

The information returned is a list of group ids in groupIdList as well as a count of how many ids there are in count. Please allocate enough memory for groupIdList. Memory of size MAX_NUM_GROUPS should be allocated for groupIdList.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupIdList – OUT: List of Group Ids

  • count – OUT: The number of Group ids in the list

Returns:

The following APIs are used for field group management.

The user can create a group of fields and perform an operation on a group of fields at once.

Functions

dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)

Used to create a group of fields and return the handle in dcgmFieldGroupId.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • numFieldIds – IN: Number of field IDs that are being provided in fieldIds[]. Must be between 1 and DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP.

  • fieldIds – IN: Field IDs to be added to the newly-created field group

  • fieldGroupName – IN: Unique name for this group of fields. This must not be the same as any existing field groups.

  • dcgmFieldGroupId – OUT: Handle to the newly-created field group

Returns:

dcgmReturn_t dcgmFieldGroupDestroy(dcgmHandle_t dcgmHandle, dcgmFieldGrp_t dcgmFieldGroupId)

Used to remove a field group that was created with dcgmFieldGroupCreate.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • dcgmFieldGroupId – IN: Field group to remove

Returns:

dcgmReturn_t dcgmFieldGroupGetInfo(dcgmHandle_t dcgmHandle, dcgmFieldGroupInfo_t *fieldGroupInfo)

Used to get information about a field group that was created with dcgmFieldGroupCreate.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • fieldGroupInfo

    IN/OUT: Info about all of the field groups that exist.

    .version should be set to

    dcgmFieldGroupInfo_version

    before this call

    .fieldGroupId should contain the fieldGroupId you are interested in querying information for.

Returns:

dcgmReturn_t dcgmFieldGroupGetAll(dcgmHandle_t dcgmHandle, dcgmAllFieldGroup_t *allGroupInfo)

Used to get information about all field groups in the system.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • allGroupInfo

    IN/OUT: Info about all of the field groups that exist.

    .version should be set to

    dcgmAllFieldGroup_version before this call.

Returns:

The following APIs are used to manage statuses for multiple operations on one or more GPUs.

Functions

dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)

Creates reference to DCGM status handler which can be used to get the statuses for multiple operations on one or more devices.

The multiple statuses are useful when the operations are performed at group level. The status handle provides a mechanism to access error attributes for the failed operations.

The number of errors stored behind the opaque handle can be accessed using the the API dcgmStatusGetCount. The errors are accessed from the opaque handle statusHandle using the API dcgmStatusPopError. The user can invoke dcgmStatusPopError for the number of errors or until all the errors are fetched.

When the status handle is not required any further then it should be deleted using the API dcgmStatusDestroy.

Parameters:

statusHandle – OUT: Reference to handle for list of statuses

Returns:

dcgmReturn_t dcgmStatusDestroy(dcgmStatus_t statusHandle)

Used to destroy status handle created using dcgmStatusCreate.

Parameters:

statusHandle – IN: Handle to list of statuses

Returns:

dcgmReturn_t dcgmStatusGetCount(dcgmStatus_t statusHandle, unsigned int *count)

Used to get count of error entries stored inside the opaque handle statusHandle.

Parameters:
  • statusHandle – IN: Handle to list of statuses

  • count – OUT: Number of error entries present in the list of statuses

Returns:

dcgmReturn_t dcgmStatusPopError(dcgmStatus_t statusHandle, dcgmErrorInfo_t *pDcgmErrorInfo)

Used to iterate through the list of errors maintained behind statusHandle.

The method pops the first error from the list of DCGM statuses. In order to iterate through all the errors, the user can invoke this API for the number of errors or until all the errors are fetched.

Parameters:
  • statusHandle – IN: Handle to list of statuses

  • pDcgmErrorInfo – OUT: First error from the list of statuses

Returns:

dcgmReturn_t dcgmStatusClear(dcgmStatus_t statusHandle)

Used to clear all the errors in the status handle created by the API dcgmStatusCreate.

After one set of operation, the statusHandle can be cleared and reused for the next set of operation.

Parameters:

statusHandle – IN: Handle to list of statuses

Returns:

Discovery

group DCGM_DISCOVERY

The following APIs are used to discover GPUs and their attributes on a Node.

Functions

dcgmReturn_t dcgmGetAllDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)

This method is used to get identifiers corresponding to all the devices on the system.

The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.

The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • gpuIdList – OUT: Array reference to fill GPU Ids present on the system.

  • count – OUT: Number of GPUs returned in gpuIdList.

Returns:

dcgmReturn_t dcgmGetAllSupportedDevices(dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[32], int *count)

This method is used to get identifiers corresponding to all the DCGM-supported devices on the system.

The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.

The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • gpuIdList – OUT: Array reference to fill GPU Ids present on the system.

  • count – OUT: Number of GPUs returned in gpuIdList.

Returns:

dcgmReturn_t dcgmGetDeviceAttributes(dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t *pDcgmAttr)

Gets device attributes corresponding to the gpuId.

If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • gpuId – IN: GPU Id corresponding to which the attributes should be fetched

  • pDcgmAttr – IN/OUT: Device attributes corresponding to gpuId

    .

    pDcgmAttr->version should be set to

    dcgmDeviceAttributes_version before this call.

Returns:

dcgmReturn_t dcgmGetEntityGroupEntities(dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t *entities, int *numEntities, unsigned int flags)

Gets the list of entities that exist for a given entity group.

This API can be used in place of dcgmGetAllDevices.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • entityGroup – IN: Entity group to list entities of

  • entities – OUT: Array of entities for entityGroup

  • numEntities – IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.

  • flags – IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* #defines in dcgm_structs.h

Returns:

dcgmReturn_t dcgmGetGpuInstanceHierarchy(dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2 *hierarchy)

Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • entities – OUT: array of entities in the hierarchy

  • numEntities – IN/OUT: Upon calling, this should be the capacity of entities. Upon return, this will contain the number of entities actually saved to entities.

Returns:

dcgmReturn_t dcgmGetNvLinkLinkStatus(dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v4 *linkStatus)

Get the NvLink link status for every NvLink in this system.

This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • linkStatus – OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.

Returns:

dcgmReturn_t dcgmGetCpuHierarchy(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v1 *cpuHierarchy)

List supported CPUs and their cores present on the system.

This and other CPU APIs only support datacenter NVIDIA CPUs. Use dcgmGetCpuHierarchy_v2 to get additional CPU information.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated

Returns:

dcgmReturn_t dcgmGetCpuHierarchy_v2(dcgmHandle_t dcgmHandle, dcgmCpuHierarchy_v2 *cpuHierarchy)

List supported CPUs and their cores present on the system.

This and other CPU APIs only support datacenter NVIDIA CPUs.

Parameters:
  • dcgmHandle – IN: DCGM Handle

  • cpuHierarchy – OUT: Structure where the CPUs and their associated cores will be enumerated

Returns:

Grouping

group DCGM_GROUPING

The following APIs are used for group management.

The user can create a group of entities and perform an operation on a group of entities. If grouping is not needed and the user wishes to run commands on all GPUs seen by DCGM then the user can use DCGM_GROUP_ALL_GPUS or DCGM_GROUP_ALL_NVSWITCHES in place of group IDs when needed.

Functions

dcgmReturn_t dcgmGroupCreate(dcgmHandle_t pDcgmHandle, dcgmGroupType_t type, const char *groupName, dcgmGpuGrp_t *pDcgmGrpId)

Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId.

Instead of executing an operation separately for each entity, the DCGM group enables the user to execute same operation on all the entities present in the group as a single API call.

To create the group with all the entities present on the system, the type field should be specified as DCGM_GROUP_DEFAULT or DCGM_GROUP_ALL_NVSWITCHES. To create an empty group, the type field should be specified as DCGM_GROUP_EMPTY. The empty group can be updated with the desired set of entities using the APIs dcgmGroupAddDevice, dcgmGroupAddEntity, dcgmGroupRemoveDevice, and dcgmGroupRemoveEntity.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • type – IN: Type of Entity Group to be formed

  • groupName – IN: Desired name of the GPU group specified as NULL terminated C string

  • pDcgmGrpId – OUT: Reference to group ID

Returns:

dcgmReturn_t dcgmGroupDestroy(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId)

Used to destroy a group represented by groupId.

Since DCGM group is a logical grouping of entities, the properties applied on the group stay intact for the individual entities even after the group is destroyed.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID

Returns:

dcgmReturn_t dcgmGroupAddDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)

Used to add specified GPU Id to the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group Id to which device should be added

  • gpuId – IN: DCGM GPU Id

Returns:

dcgmReturn_t dcgmGroupAddEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)

Used to add specified entity to the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group Id to which device should be added

  • entityGroupId – IN: Entity group that entityId belongs to

  • entityId – IN: DCGM entityId

Returns:

dcgmReturn_t dcgmGroupRemoveDevice(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, unsigned int gpuId)

Used to remove specified GPU Id from the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID from which device should be removed

  • gpuId – IN: DCGM GPU Id

Returns:

dcgmReturn_t dcgmGroupRemoveEntity(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId)

Used to remove specified entity from the group represented by groupId.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID from which device should be removed

  • entityGroupId – IN: Entity group that entityId belongs to

  • entityId – IN: DCGM entityId

Returns:

dcgmReturn_t dcgmGroupGetInfo(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, dcgmGroupInfo_t *pDcgmGroupInfo)

Used to get information corresponding to the group represented by groupId.

The information returned in pDcgmGroupInfo consists of group name, and the list of entities present in the group.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID for which information to be fetched

  • pDcgmGroupInfo – OUT: Group Information

Returns:

dcgmReturn_t dcgmGroupGetAllIds(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupIdList[], unsigned int *count)

Used to get the Ids of all groups of entities.

The information returned is a list of group ids in groupIdList as well as a count of how many ids there are in count. Please allocate enough memory for groupIdList. Memory of size MAX_NUM_GROUPS should be allocated for groupIdList.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupIdList – OUT: List of Group Ids

  • count – OUT: The number of Group ids in the list

Returns:

Field Grouping

group DCGM_FIELD_GROUPING

The following APIs are used for field group management.

The user can create a group of fields and perform an operation on a group of fields at once.

Functions

dcgmReturn_t dcgmFieldGroupCreate(dcgmHandle_t dcgmHandle, int numFieldIds, unsigned short *fieldIds, const char *fieldGroupName, dcgmFieldGrp_t *dcgmFieldGroupId)

Used to create a group of fields and return the handle in dcgmFieldGroupId.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • numFieldIds – IN: Number of field IDs that are being provided in fieldIds[]. Must be between 1 and DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP.

  • fieldIds – IN: Field IDs to be added to the newly-created field group

  • fieldGroupName – IN: Unique name for this group of fields. This must not be the same as any existing field groups.

  • dcgmFieldGroupId – OUT: Handle to the newly-created field group

Returns:

dcgmReturn_t dcgmFieldGroupDestroy(dcgmHandle_t dcgmHandle, dcgmFieldGrp_t dcgmFieldGroupId)

Used to remove a field group that was created with dcgmFieldGroupCreate.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • dcgmFieldGroupId – IN: Field group to remove

Returns:

dcgmReturn_t dcgmFieldGroupGetInfo(dcgmHandle_t dcgmHandle, dcgmFieldGroupInfo_t *fieldGroupInfo)

Used to get information about a field group that was created with dcgmFieldGroupCreate.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • fieldGroupInfo

    IN/OUT: Info about all of the field groups that exist.

    .version should be set to

    dcgmFieldGroupInfo_version

    before this call

    .fieldGroupId should contain the fieldGroupId you are interested in querying information for.

Returns:

dcgmReturn_t dcgmFieldGroupGetAll(dcgmHandle_t dcgmHandle, dcgmAllFieldGroup_t *allGroupInfo)

Used to get information about all field groups in the system.

Parameters:
  • dcgmHandle – IN: DCGM handle

  • allGroupInfo

    IN/OUT: Info about all of the field groups that exist.

    .version should be set to

    dcgmAllFieldGroup_version before this call.

Returns:

Status Handling

group DCGMAPI_ST

The following APIs are used to manage statuses for multiple operations on one or more GPUs.

Functions

dcgmReturn_t dcgmStatusCreate(dcgmStatus_t *statusHandle)

Creates reference to DCGM status handler which can be used to get the statuses for multiple operations on one or more devices.

The multiple statuses are useful when the operations are performed at group level. The status handle provides a mechanism to access error attributes for the failed operations.

The number of errors stored behind the opaque handle can be accessed using the the API dcgmStatusGetCount. The errors are accessed from the opaque handle statusHandle using the API dcgmStatusPopError. The user can invoke dcgmStatusPopError for the number of errors or until all the errors are fetched.

When the status handle is not required any further then it should be deleted using the API dcgmStatusDestroy.

Parameters:

statusHandle – OUT: Reference to handle for list of statuses

Returns:

dcgmReturn_t dcgmStatusDestroy(dcgmStatus_t statusHandle)

Used to destroy status handle created using dcgmStatusCreate.

Parameters:

statusHandle – IN: Handle to list of statuses

Returns:

dcgmReturn_t dcgmStatusGetCount(dcgmStatus_t statusHandle, unsigned int *count)

Used to get count of error entries stored inside the opaque handle statusHandle.

Parameters:
  • statusHandle – IN: Handle to list of statuses

  • count – OUT: Number of error entries present in the list of statuses

Returns:

dcgmReturn_t dcgmStatusPopError(dcgmStatus_t statusHandle, dcgmErrorInfo_t *pDcgmErrorInfo)

Used to iterate through the list of errors maintained behind statusHandle.

The method pops the first error from the list of DCGM statuses. In order to iterate through all the errors, the user can invoke this API for the number of errors or until all the errors are fetched.

Parameters:
  • statusHandle – IN: Handle to list of statuses

  • pDcgmErrorInfo – OUT: First error from the list of statuses

Returns:

dcgmReturn_t dcgmStatusClear(dcgmStatus_t statusHandle)

Used to clear all the errors in the status handle created by the API dcgmStatusCreate.

After one set of operation, the statusHandle can be cleared and reused for the next set of operation.

Parameters:

statusHandle – IN: Handle to list of statuses

Returns: