1.2.1. Discovery

[System]

The following APIs are used to discover GPUs and their attributes on a Node.

Functions

dcgmReturn_t dcgmGetAllDevices ( dcgmHandle_t pDcgmHandle, unsigned int  gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
dcgmReturn_t dcgmGetAllSupportedDevices ( dcgmHandle_t pDcgmHandle, unsigned int  gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
dcgmReturn_t dcgmGetDeviceAttributes ( dcgmHandle_t pDcgmHandle, unsigned int  gpuId, dcgmDeviceAttributes_t* pDcgmAttr )
dcgmReturn_t dcgmGetEntityGroupEntities ( dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t* entities, int* numEntities, unsigned int  flags )
dcgmReturn_t dcgmGetGpuInstanceHierarchy ( dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2* hierarchy )
dcgmReturn_t dcgmGetNvLinkLinkStatus ( dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v2* linkStatus )

Functions

dcgmReturn_t dcgmGetAllDevices ( dcgmHandle_t pDcgmHandle, unsigned int  gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
Parameters
pDcgmHandle
IN: DCGM Handle
gpuIdList
OUT: Array reference to fill GPU Ids present on the system.
count
OUT: Number of GPUs returned in gpuIdList.
Returns

Description

This method is used to get identifiers corresponding to all the devices on the system. The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.

The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().

dcgmReturn_t dcgmGetAllSupportedDevices ( dcgmHandle_t pDcgmHandle, unsigned int  gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
Parameters
pDcgmHandle
IN: DCGM Handle
gpuIdList
OUT: Array reference to fill GPU Ids present on the system.
count
OUT: Number of GPUs returned in gpuIdList.
Returns

Description

This method is used to get identifiers corresponding to all the DCGM-supported devices on the system. The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.

The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().

dcgmReturn_t dcgmGetDeviceAttributes ( dcgmHandle_t pDcgmHandle, unsigned int  gpuId, dcgmDeviceAttributes_t* pDcgmAttr )
Parameters
pDcgmHandle
IN: DCGM Handle
gpuId
IN: GPU Id corresponding to which the attributes should be fetched
pDcgmAttr
IN/OUT: Device attributes corresponding to gpuId. pDcgmAttr->version should be set to dcgmDeviceAttributes_version before this call.
Returns

Description

Gets device attributes corresponding to the gpuId. If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.

dcgmReturn_t dcgmGetEntityGroupEntities ( dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t* entities, int* numEntities, unsigned int  flags )
Parameters
dcgmHandle
IN: DCGM Handle
entityGroup
IN: Entity group to list entities of
entities
OUT: Array of entities for entityGroup
numEntities
IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.
flags
IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* defines in dcgm_structs.h
Returns

Description

Gets the list of entities that exist for a given entity group. This API can be used in place of dcgmGetAllDevices.

dcgmReturn_t dcgmGetGpuInstanceHierarchy ( dcgmHandle_t dcgmHandle, dcgmMigHierarchy_v2* hierarchy )
Parameters
dcgmHandle
IN: DCGM Handle
hierarchy
Returns

Description

Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent

dcgmReturn_t dcgmGetNvLinkLinkStatus ( dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v2* linkStatus )
Parameters
dcgmHandle
IN: DCGM Handle
linkStatus
OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.
Returns

Description

Get the NvLink link status for every NvLink in this system. This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.